-
Notifications
You must be signed in to change notification settings - Fork 5
Home
Welcome to the cbird wiki! Just random stuff for now
cbird indexing is based on a single top-level directory, and stores information using relative paths. The reason is to prevent use cases that can break the entire index, since it can take quite a while to compute:
- It is not possible to break the index by renaming/copying/moving the top-level directory
- The index can be shared on the network without causing any problems
However this design comes with a few issues:
- Paths containing symlinks cannot always be simplified, as they could point outside of the index. This will result in unwanted duplicates
-
-use
is always needed when working in a sub-directory (the default directory is CWD). This less annoying with-use @
which searches in the parent tree for the first index it finds. - Since windows doesn't have a single-root filesystem it can be a challenge. The
mklink
program can be used to link all content into some top-level directory of your choosing, then you can use-i.links true
The indexing process consists of a few steps
- Reading options:
-use, -i.*
etc - Verifying: Check state of the database (missing expected files, inconsistent algos, out-of-date items etc)
- Scanning: Find all candidate files for addition. Apply filters, follow links etc
- Deleting: Remove files in the database that were missing from the scan
- Computing: algos compute hashes, features, etc and add them to the index
Options to the indexer are refereed as "index params", which are set with -i.<name> <value>
. You can get a list of names with -h
or -list-index-params
which will also give the current value and a short description.
There are a couple of gotchas with index params to be aware of:
- index params are not saved to disk and must be specified every time you run cbird. The best way to work around this is to make a command alias or script to invoke cbird so this remains a constant
- index options only apply to the
-update
operation - index options apply to the path scanned for changes this is either the path given to
-use
or the optional path given to-update
The directory tree may contain symlinks, and they will be followed when (-i.links true
). This potentially causes unwanted duplicates, so we have the following options:
-
-i.dups
ignores duplicate inodes, which are guaranteed to be duplicate files - but this cannot work across filesystems. -
-i.resolve
attempts to resolve links before adding files. However this can only work if the link resolution does not point outside of the top-level directory.
By default, -update
scans all files and directories from the top-level directory (given by -use
). All supported file types will be considered. This can be limited in several ways:
- passing a directory after
-update
will only scan in that directory; index options will only apply to this directory and not to the index as a whole. Cross-index queries like-similar
may not work as expected unless the same settings are given. -
-i.algos
chooses which algorithms to compute -
-i.types
chooses which file types (image/video) to consider -
-i.fsize
sets the minimum file size -
-i.dirs
enables recursive scanning of sub-directories
Filters can also be applied on file paths (full relative path from the top-level) using -i.include
or -i.exclude
, which can be used multiple times.
- If both include/exclude are specified, the include is considered first, then the exclude.
- This will not prevent scanning all files/directories, the filter is applied to file paths only
Examples:
-
-i.include "*.jpg"
=> only add files with .jpg suffix -
-i.exclude "/some/dir/
=> do not add any files with path matching "/some/dir/" -
-i.include ":(jpg|png|gif)$" -exclude "*/originals/*"
=> add jpg,png,and gif files, except if they are located in "originals" folder
Normally, you don't care as they are all enabled by default (see space usage). If you know that particular algos are not going to be useful, you can speed up the scan and save a little space with -i.algos
. The fdct
, orb
, and color
algos are much slower to compute and perhaps not useful for very large data sets.
The available algos are:
-
0
means don't use any algos, cbird will only be able to find exact duplicates with-dups
-
dct
finds rescaled and lightly altered images. You should always have this enabled as it is basically free to compute and store -
fdct
finds rescaled and cropped images -
orb
finds rescaled, cropped, and rotated images -
color
finds images with similar color palette; can find mirrored images quickly, useful for sorting/organizing
Examples:
-
-i.algos 0
=> fastest option, only-dups
will work -
-i.algos dct
=> fast indexing but can't find cropped or rotated images -
-i.algos dct+fdct
=> slow indexing,fdct
can help find cropped images as well, much faster thanorb
-
-i.algos dct+orb
=> slow indexing,orb
can find challenging duplicates like rotations
As of v0.8, the space requirement for all cbird algorithms (with cache files) is around 30Kb per image. As such, it will usually be of no concern. However, there are a few options available to manage this.:
-
-i.algos
sets the algos for indexing. The algos in order of heaviest-to-lightest areorb,fdct,video,color,dct
. Note that md5 checksums are always enabled. -
-i.nfeat
sets the number of transform-invariant features used per image. These are what allow cbird to find images that are cropped or rotated. This only affectsorb
andfdct
algos. Fewer features will reduce search quality somewhat, but may still be usable. -
-vaccuum
can compact the database files if you have made a lot of deletions/removals.
To remove algos you are no longer using, you can remove them from the database with -select-* -remove
, and then re-index them with (-update
with the changed algos. Since the heavier algos dominate indexing time, it might make sense just to delete the affected database files and cache directly as sqlite can be slow:
-
_index/cache/
contains files to speed up-similar
, and can be rebuilt from database files -
_index/media0.db
contains file paths, checksums, and dct hash; do not delete this directly -
_index/media<N>.db
contains data for algo N. (1=fdct,2=orb,3=color,4=video). Deleting these drops that particular algo -
_index/video
contains video file indexes. Deleting this effectively drops the video algo
There are two types of lists in cbird, "selections" and "results". A selection (aka "group" or "MediaGroup") is basically a file list. Technically it is a list of "Media" objects so it doesn't necessarily have to be a file, it could just be raw image data and a description. For example, -select-grid
can cut up an image into separate items for searching.
A result (aka "MediaGroupList") is a two-dimensional list where each item is a selection, and the first item is (by convention) the needle in the search query. Results usually come from search queries but can also be formed by other commands like -group-by
.
In cbird, there is always a current selection, by default, it is empty. The selection is built from -select-
commands (mostly), which can be combined as needed to get the desired set of files.
- The selection is referred to in commands that take a "selector" with the special symbol "@"
- Each
-select-
command appends to the current selection -
-select-none
clears the current selection
When a command generates a result, consumes or otherwise invalidates the selection in any way (-nuke
) the current selection is cleared. This prevents ambiguity with commands that can take either a selection or result (-show
)
Filters can be used for a few things, but usually it is
- limit scope of the search
- remove or select certain results
Say you know that the originals were taken before a certain date, so you only want to find duplicates of these.
You can use -with
to select the subset, then search within the subset using "@" to refer to it.
cbird -select-type i -with exif#Photo.DateTimeOriginal#todate '<2022-01-03' -similar-to @
You have a project folder that is known to have valid copies of your assets/originals folder, and you don't want to include it. You can use a regular expression to select the search set, then search within it.
cbird -select-all -without relPath ':^projects/.*' -similar-in @
You have a folder "incoming" with new content, if the file size is larger than the existing content, then you want to examine it since it's more likely to be a dupe you want to keep. Note: there is no guarantee that this assumption is correct, but it may suit your needs.
cbird -similar-to ./incoming/ -with fileSize '<=%needle' -show
You can now batch-delete the other ones by inverting the filter:
cbird -similar-to ./incoming/ -without fileSize '<=%needle' -first -select-result -nuke
If you want to combine filters there are two boolean options. These are evaluated left-to right. The -or-*
version (added in v0.7.1) must be preceded by a -with
filter.
-
-with [this] -with [that]
-- with this AND that -
-with [this] -or-with [that]
-- with this OR that -
-with [this] -or-with [that] -with [theother]
-- with (this OR that) AND theother
A property can be followed by hash (#) and a series of transformations/functions
Lowercasing to Test Strings
-with name#lower '~robert paulson'
Date/time conversions
-with exif#Photo.DateTimeOriginal#month '<2020-01'
Boolean tests on properties (v0.7.1)
-
-with name#lower '~milk || ~cookies'
== images of only milk, only cookies, or both -
-with name#lower '~milk && ~cookies'
== images with both milk and cookies
Type conversions for metadata properties to enable correct evaluation (v0.7.1)
-with exif#Photo.DateTimeOriginal#todate '>=2021-01-01 && <2022-02-01'
- date is sometime in January 2021
Comparison with the needle property (v0.7.1)
-similar-to ./originals -with res '==%needle' -with suffix '==%needle' -with fileSize '<%needle'
- assume dupes are of lower quality due to smaller size at the same resolution and file type