rgpipe is a single bash/sh script and an alias to use with ripgrep to search through a myriad of file types that are otherwise not grep friendly. Use it with ripgrep's -pre command which allows ripgrep to selectively process files before searching.
The most basic usage is to point rgpipe at some file, and it will attempt to print the contents of said file to stdout.
rgpipe MyFancyExcelFile.xlsx
The more involved usage is as a filter in front of ripgrep to systematically attempt to grep through the contents of assorted non-text files much as you would text files. The basic incantation looks like:
rg --pre-glob '*.{xlsx,pptx,docx,pdf}' --pre rgpipe "$YourSearchTermHere"
I wrote up an extended gist about how to use it here
That gist is only useful because of the kind note by BurntSushi in this hacker news comment explaining how rg --pre-glob works.
This helps grep through:
- New MS Office files (DOCX, PPTX, XLSX, variants thereof)
- Uses
unzipandsed
- Uses
- Old MS Office files (DOC, PPT, XLS, variants thereof) & new excel binary format
- Uses
strings
- Uses
- LibreOffice files (ODS, ODT, ODP)
- Uses
unzipandsed
- Uses
- PDF
- Uses
pdftottextfrom poppler
- Uses
- Web/structured formats (HTML, XHTML ...)
- Uses
w3mlynx and friends also works. Not 100% necessary.
- Uses
- Web formats disguised as books (chm, epub)
unzipandw3mfor EPUB7zipandw3mfor chm
Ubuntu wants: sudo apt install poppler-utils p7zip w3m unzip
termux wants: pkg install poppler p7zip w3m
Assuming rgpipe is in path, use /path/to/rgpipe if it's not
rg --pre rgpipe YourSearchTermHereAbove uses rgpipe even when it's not needed, which is slow, ripgrep can selectively use it with --pre-glob
rg --pre-glob '*.{xlsx,pptx,docx,pdf}' --pre rgpipe YourSearchTermHereA more thorough pre glob:
rg --pre-glob '*.{pdf,xl[tas][bxm],xl[wsrta],do[ct],do[ct][xm],p[po]t[xm],p[op]t,html,htm,xhtm,xhtml,epub,chm,od[stp]}' --pre rgpipe YourSearchTermHereAn alias because that is a lot of typing
alias rgg="rg -i -z --max-columns-preview --max-columns 500 --hidden --no-ignore --pre-glob \
'*.{pdf,xl[tas][bxm],xl[wsrta],do[ct],do[ct][xm],p[po]t[xm],p[op]t,html,htm,xhtm,xhtml,epub,chm,od[stp]}' --pre rgpipe"Step 1: use rgpipe to make text sidecar files
find-rgpipe-type() {
find `pwd` -type f -iname "*.$1" -exec sh -c 'for f; do rgpipe "$f" > "${f%.*}.txt"; done' _ {} +
}
# or get fancy with xargs for multithreaded goodness
find-rgpipe-type-xargs() {
find "$(pwd)" -type f -iname "*.$1" -print0 | xargs -0 -P0 -n 1 -I {} sh -c 'rgpipe "{}" > "{}.txt"'
}
Make text sidecars for all files with PDF extension under current directory using the function defined above.
find-rgpipe-type pdfStep 2: Use ripgrep to search those files
rg YourSearchTermHere2 - The pre processing script that is the template into which I added some more file types
3 - midnight commander has great scripts on this subject
5 - rga is a rust based tool doing a similar thing
rgpipe because the idea is similar to lesspipe.