Skip to content
scrubbbbs edited this page Feb 11, 2025 · 26 revisions

Welcome to the cbird wiki! Just random stuff for now

Indexing

cbird indexing is based on a single top-level directory, and stores information using relative paths. The reason is to prevent use cases that can break the entire index, since it can take quite a while to compute:

  • It is not possible to break the index by renaming/copying/moving the top-level directory
  • The index can be shared on the network without causing any problems

However this design comes with a few issues:

  • Paths containing symlinks cannot always be simplified, as they could point outside of the index. This will result in unwanted duplicates
  • -use is always needed when working in a sub-directory (the default directory is CWD). This less annoying with -use @ which searches in the parent tree for the first index it finds.
  • Since windows doesn't have a single-root filesystem it can be a challenge. The mklink program can be used to link all content into some top-level directory of your choosing, then you can use -i.links true

Index options

Options to the indexer are refereed as "index params", which are set with -i.<name> <value>. You can get a list of names with -h or -list-index-params which will also give the current value and a short description.

There are a couple of gotchas with index params to be aware of:

  • index params are not saved to disk and must be specified every time you run cbird. The best way to work around this is to make a command alias or script to invoke cbird so this remains a constant
  • index options only apply to the -update operation
  • index options apply to the path scanned for changes this is either the path given to -use or the optional path given to -update

Using symlinks

The directory tree may contain symlinks, and they will be followed when (-i.links true). This potentially causes unwanted duplicates, so we have the following options:

  • -i.dups ignores duplicate inodes, which are guaranteed to be duplicate files - but this cannot work across filesystems.
  • -i.resolve attempts to resolve links before adding files. However this can only work if the link resolution does not point outside of the top-level directory.

Limiting the scan

By default, -update scans all files and directories from the top-level directory (given by -use). All supported file types will be considered. This can be limited in several ways:

  • passing a directory after -update will only scan in that directory; index options will only apply to this directory and not to the index as a whole. Cross-index queries like -similar may not work as expected unless the same settings are given.
  • -i.algos chooses which algorithms to compute
  • -i.types chooses which file types (image/video) to consider
  • -i.fsize sets the minimum file size
  • -i.dirs enables recursive scanning of sub-directories

Filters can also be applied on file paths (full relative path from the top-level) using -i.include or -i.exclude, which can be used multiple times.

  • If both include/exclude are specified, the include is considered first, then the exclude.
  • This will not prevent scanning all files/directories, the filter is applied to file paths only

Examples:

  • -i.include "*.jpg" => only add files with .jpg suffix
  • -i.exclude "/some/dir/ => do not add any files with path matching "/some/dir/"
  • -i.include ":(jpg|png|gif)$" -exclude "*/originals/*" => add jpg,png,and gif files, except if they are located in "originals" folder

Choosing search algorithms (algos)

Normally, you don't care as they are all enabled by default (see space usage). If you know that particular algos are not going to be useful, you can speed up the scan and save a little space with -i.algos. The fdct, orb, and color algos are much slower to compute and perhaps not useful for very large data sets.

The available algos are:

  • 0 means don't use any algos, cbird will only be able to find exact duplicates with -dups
  • dct finds rescaled and lightly altered images. You should always have this enabled as it is basically free to compute and store
  • fdct finds rescaled and cropped images
  • orb finds rescaled, cropped, and rotated images
  • color finds images with similar color palette; can find mirrored images quickly, useful for sorting/organizing

Examples:

  • -i.algos 0 => fastest option, only -dups will work
  • -i.algos dct => fast indexing but can't find cropped or rotated images
  • -i.algos dct+fdct => slow indexing, fdct can help find cropped images as well, much faster than orb
  • -i.algos dct+orb => slow indexing, orb can find challenging duplicates like rotations

Space usage

As of v0.8, the space requirement for all cbird algorithms (with cache files) is around 30Kb per image. As such, it will usually be of no concern. However, there are a few options available to manage this.:

  • -i.algos sets the algos for indexing. The algos in order of heaviest-to-lightest are orb,fdct,video,color,dct. Note that md5 checksums are always enabled.
  • -i.nfeat sets the number of transform-invariant features used per image. These are what allow cbird to find images that are cropped or rotated. This only affects orb and fdct algos. Fewer features will reduce search quality somewhat, but may still be usable.
  • -vaccuum can compact the database files if you have made a lot of deletions/removals.

Removing algos

To remove algos you are no longer using, you can remove them from the database with -select-* -remove, and then re-index them with (-update with the changed algos. Since the heavier algos dominate indexing time, it might make sense just to delete the affected database files and cache directly as sqlite can be slow:

  • _index/cache/ contains files to speed up -similar, and can be rebuilt from database files
  • _index/media0.db contains file paths, checksums, and dct hash; do not delete this directly
  • _index/media<N>.db contains data for algo N. (1=fdct,2=orb,3=color,4=video). Deleting these drops that particular algo
  • _index/video contains video file indexes. Deleting this effectively drops the video algo

Using selections and results

There are two types of lists in cbird, "selections" and "results". A selection (aka "group" or "MediaGroup") is basically a file list. Technically it is a list of "Media" objects so it doesn't necessarily have to be a file, it could just be raw image data and a description. For example, -select-grid can cut up an image into separate items for searching.

A result (aka "MediaGroupList") is a two-dimensional list where each item is a selection, and the first item is (by convention) the needle in the search query. Results usually come from search queries but can also be formed by other commands like -group-by.

In cbird, there is always a current selection, by default, it is empty. The selection is built from -select- commands (mostly), which can be combined as needed to get the desired set of files.

  • The selection is referred to in commands that take a "selector" with the special symbol "@"
  • Each -select- command appends to the current selection
  • -select-none clears the current selection

When a command generates a result, consumes or otherwise invalidates the selection in any way (-nuke) the current selection is cleared. This prevents ambiguity with commands that can take either a selection or result (-show)

Filtering

Filters can be used for a few things, but usually it is

  • limit scope of the search
  • remove or select certain results

Limiting The Scope

Say you know that the originals were taken before a certain date, so you only want to find duplicates of these. You can use -with to select the subset, then search within the subset using "@" to refer to it.

cbird -select-type i -with exif#Photo.DateTimeOriginal#todate '<2022-01-03' -similar-to @

You have a project folder that is known to have valid copies of your assets/originals folder, and you don't want to include it. You can use a regular expression to select the search set, then search within it.

cbird -select-all -without relPath ':^projects/.*' -similar-in @

Removing Unlikely Matches

You have a folder "incoming" with new content, if the file size is larger than the existing content, then you want to examine it since it's more likely to be a dupe you want to keep. Note: there is no guarantee that this assumption is correct, but it may suit your needs.

cbird -similar-to ./incoming/ -with fileSize '<=%needle' -show

You can now batch-delete the other ones by inverting the filter:

cbird -similar-to ./incoming/ -without fileSize '<=%needle' -first -select-result -nuke

Combining Filters

If you want to combine filters there are two boolean options. These are evaluated left-to right. The -or-* version (added in v0.7.1) must be preceded by a -with filter.

  • -with [this] -with [that] -- with this AND that
  • -with [this] -or-with [that] -- with this OR that
  • -with [this] -or-with [that] -with [theother] -- with (this OR that) AND theother

Using Property Expressions

A property can be followed by hash (#) and a series of transformations/functions

Lowercasing to Test Strings

  • -with name#lower '~robert paulson'

Date/time conversions

  • -with exif#Photo.DateTimeOriginal#month '<2020-01'

Using Filter Expressions

Boolean tests on properties (v0.7.1)

  • -with name#lower '~milk || ~cookies' == images of only milk, only cookies, or both
  • -with name#lower '~milk && ~cookies' == images with both milk and cookies

Type conversions for metadata properties to enable correct evaluation (v0.7.1)

  • -with exif#Photo.DateTimeOriginal#todate '>=2021-01-01 && <2022-02-01'
  • date is sometime in January 2021

Comparison with the needle property (v0.7.1)

  • -similar-to ./originals -with res '==%needle' -with suffix '==%needle' -with fileSize '<%needle'
  • assume dupes are of lower quality due to smaller size at the same resolution and file type
Clone this wiki locally