From: jean.boussier@... Date: 2020-06-30T19:57:25+00:00 Subject: [ruby-core:98999] [Ruby master Feature#17001] [Feature] Dir.scan to yield dirent for efficient and composable recursive directory scaning Issue #17001 has been reported by byroot (Jean Boussier). ---------------------------------------- Feature #17001: [Feature] Dir.scan to yield dirent for efficient and composable recursive directory scaning https://bugs.ruby-lang.org/issues/17001 * Author: byroot (Jean Boussier) * Status: Open * Priority: Normal ---------------------------------------- ### Use case When you need to recusrsively scan a directory, you either have to use `Dir[]` / `Dir.glob`, which is fine for small directories or simple patterns, but can easily take several seconds to complete for large repositories or complex patterns and returns a very large array which tend to trash GC. Or you can use `Dir.each_entry` / `Dir.foreach` recursively, but then you need to `stat` each entry to know wether it's a directory, or even symlink if you want to follow them. This means one syscall per directory, and one per file and directories. This is particularly impactful on OSX where `stat()` is several times slower than on Linux because of various sandboxing features. There's a [typical example of this use case in Bootsnap](https://github.com/Shopify/bootsnap/blob/56c61373000573112ee027dae4be19aecd50e46e/lib/bootsnap/load_path_cache/path_scanner.rb). ### Proposal [Python introduced `os.scandir` a few years ago](https://www.python.org/dev/peps/pep-0471/) for exactly this purpose. It is functionaly similar to `Dir.foreach` / `Dir.each_child`, except it yields `DirEntry` instances which are a wrapper around the `libc` `dirent` struct. I reduced the Bootsnap code into a [simplified benchmark](https://gist.github.com/casperisfine/2124f349c6564560df4399f2eadaa8f2), and using `os.scandir()` Python scan our main repo in a bit over `1s`, which 3 to 4 times faster than Ruby can with `Dir.foreach` (`3-4s`). For comparison sake `Dir['**/*.rb']` also complete in about `1s`. So I beleive that exposing a similar `Dir.scan` method, returning `Dir::Entry` instances, with methods inspired from `File::Stat` such as `directory?` would allow for more performant file system scaning when the query is not easily expressed with a glob pattern. -- https://bugs.ruby-lang.org/ Unsubscribe: