Feature Request: Match dissimilar files with filename appendage

Qvak · Post by **Qvak** » Sat Dec 17, 2022 8:06 am

Request:
The ability to match by filename excluding an appendage. Filename appendage exclusion options can include a) Prefix appendage, b) Suffix appendage, c) Date and/or Time appendage. Excluding dates/times using localization masks may not be useful since there are no constraints on how the date numbers can be included in the filename and frequently the separation characters must be changed.

Matching Attempts / Testing Performed:
I've tried to match a folder tree of PDF files that originated from either scanning or downloading. Most but not all of the origin files have been duplicated using PDF software and were saved to a file with the same name but appended with a character pattern ('_r'). The duplicate file contents may be an exact duplicate but are often different in content and size as a result of compressing and OCRing the file contents.

Scanning in Regular mode: Ignore content, and using only More duplicate options: [all filename combinations], and varied Text matching options resulted in no usable matches. The results were unremarkable with the exception that changing the Similar text - Tolerance between 2 to 3 seemed to vary the results the most.

Integration Ideas:
Include Scan criteria: More duplicate options: to scan by Similar filenames with the ability to exclude differences at the a) beginning of filename or b) end of filename, or Text matching options including a Character pattern checkbox.

Use Cases:
1) If a user is working on a file in an uncontrolled environment, they may save ongoing revisions by adding to the filename: a date, time, user, or some brief explanation about the change. This feature could allow the user to find duplicates of all origin documents to a) delete unnecessary or confusing duplicates across the Scan locations, or b) assure that an 'active' document doesn't exist erroneously on multiple branches. This is especially useful in a family or SOHO environment where a single cloud/desktop storage account is used, such as Dropbox, and one or more users don't understand file systems or filing policies, or may unknowingly have moved or copied a folder branch to a different location without noticing the change.

2) I typically scan or download documents as PDF and place them in a single folder tree awaiting to be optimized and OCRed using a batch process 'Action Wizard' feature in my PDF reader/writer software. This is performed every couple of weeks or month, and can include things like downloaded bank statements, scanned receipts, medical records, and most other things in life. The process of optimizing and OCRing the PDFs will result in different file sizes and contents. For file integrity I set the process to result in a new file with the same name but appended with '_r'. When the action is complete I delete the origin files. However there are usually some files that didn't process. It can be because I forgot to unlock a pdf after downloading in Windows or had included an image file that would not be processed. Considering this is a batch process, referring to log files isn't time effective, and there is no effective way to sort and filter the files using the filename or metadata.

Duplicate Cleaner Pro is probably the simplest, most helpful, most useful, and most effective utility I've ever used. I hope you can find benefit in my request.

(Due to vision limitations, I may not respond to replies or may reply very late - Sorry!)