Hi All,

I recently tried out the KDE/Plasma search (Baloo).

  1. Indexing full content was too slow (I have some 100GB of data), and I disabled it.

  2. Indexing filenames only was reasonably quick.

  3. The search was very restrictive (full words only, miscategorized files). To make it usable for me, I had to get a list of all files and dump it to fzf, which worked reasonably well.

  4. Using baloosearch6 to get a long list of files provides almost no noticable performance improvment over fd:

     > time ( baloosearch6 mimetype:application/pdf | wc -l )
     0.05s user 0.03s system 111% cpu 0.069 total
    
    
     > time ( \fd -H --no-ignore-vcs --xdev -tf -tl '.pdf$' | wc -l ) 
     0.24s user 0.15s system 364% cpu 0.107 total
    

    (Both commands found about 11,000 files. I’m using a SSD with about 500mbps read speed).

  5. If I try it again with a larger file set :

     > time ( baloosearch6 -d VSync/ '' | wc -l ) 
     0.23s user 0.10s system 123% cpu 0.264 total
    
     > time ( \fd -H --no-ignore-vcs --xdev -tf -tl --base-directory=VSync/ | wc -l )
     0.13s user 0.11s system 456% cpu 0.052 total
    

    This time baloo found 96000 files, and fd found 59000 files. (fd might have run faster cause of disk caching.)

fd used more CPU no doubt. But the wall time difference in performance is so small that it doesn’t make sense to me to use an indexed search anymore.

Any thoughts?

  • just_another_person@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    21 days ago

    Metadata and context is the difference.

    Using fd you literally only grab a list of, well…file descriptors. It’s not looking into content of anything, and specific built-ins ignore things like contents of gitignore files. See for yourself.

    You’re comparing apples and oranges here.

    If you have 100GB, the question is more about what you want it to scan, and why. If you don’t need to know where media files are, exclude those directories. Same with git repos and such.

    • gi1242@lemmy.worldOP
      link
      fedilink
      arrow-up
      1
      ·
      21 days ago

      My images etc. are on a separate partition (300GB, not indexed). I certainly have tonnes of data in .git folders, which fd ignores. But the exclude_folders setting in baloofilerc seems to ignore most of these by default.

      I agree metadata and context makes a huge difference. Looking at my work flow, I’ve put all the data I need into the file names 😄. The metadata is borked for most of them cause many were download some 20+ years ago. So I put the author names and title into the filename to make it easy to search…

      Unfortunately the full path is ignored by Baloo. There’s main.* file in several folders; the parent folder name is ignored by Baloo search, so I dump all Baloo results to fzf and search there…