summaryrefslogtreecommitdiff
path: root/dir.c
AgeCommit message (Collapse)AuthorFilesLines
2020-01-16dir: point treat_leading_path() warning to the right placeLibravatar Jeff King1-3/+3
Commit 777b420347 (dir: synchronize treat_leading_path() and read_directory_recursive(), 2019-12-19) tried to add two warning comments in those functions, pointing at each other. But the one in treat_leading_path() just points at itself. Let's fix that. Since the comment also redirects the reader for more details to "the commit that added this warning", and since we're now modifying the warning (creating a new commit without those details), let's mention the actual commit id. Signed-off-by: Jeff King <peff@peff.net> Reviewed-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16dir: restructure in a way to avoid passing around a struct direntLibravatar Jeff King1-42/+31
Restructure the code slightly to avoid passing around a struct dirent anywhere, which also enables us to avoid trying to manufacture one. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2020-01-16dir: treat_leading_path() and read_directory_recursive(), round 2Libravatar Elijah Newren1-0/+4
I was going to title this "dir: more synchronizing of treat_leading_path() and read_directory_recursive()", a nod to commit 777b42034764 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19), but the title was too long. Anyway, first the backstory... fill_directory() has always had a slightly error-prone interface: it returns a subset of paths which *might* match the specified pathspec; it was intended to prune away some paths which didn't match the specified pathspec and keep at least all the ones that did match it. Given this interface, callers were responsible to post-process the results and check whether each actually matched the pathspec. builtin/clean.c did this. It would first prune out duplicates (e.g. if "dir" was returned as well as all files under "dir/", then it would simplify this to just "dir"), and after pruning duplicates it would compare the remaining paths to the specified pathspec(s). This post-processing itself could run into problems, though, as noted in commit 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17): For the case of git-clean and a set of pathspecs of "dir/file" and "more", this caused a problem because we'd end up with dir entries for both of "dir" "dir/file" Then correct_untracked_entries() would try to helpfully prune duplicates for us by removing "dir/file" since it's under "dir", leaving us with "dir" Since the original pathspec only had "dir/file", the only entry left doesn't match and leaves nothing to be removed. (Note that if only one pathspec was specified, e.g. only "dir/file", then the common_prefix_len optimizations in fill_directory would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.) That commit fixed the issue -- when multiple pathspecs were specified -- by making sure fill_directory() wouldn't return both "dir" and "dir/file" outside the common_prefix_len optimization path. This is where it starts to get fun. In commit b9670c1f5e6b ("dir: fix checks on common prefix directory", 2019-12-19), we noticed that the common_prefix_len wasn't doing appropriate checks and letting all kinds of stuff through, resulting in recursing into .git/ directories and other craziness. So it started locking down and doing checks on pathnames within that code path. That continued with commit 777b42034764 ("dir: synchronize treat_leading_path() and read_directory_recursive()", 2019-12-19), which noted the following: Our optimization to avoid calling into read_directory_recursive() when all pathspecs have a common leading directory mean that we need to match the logic that read_directory_recursive() would use if we had just called it from the root. Since it does more than call treat_path() we need to copy that same logic. ...and then it more forcefully addressed the issue with this wonderfully ironic statement: Needing to duplicate logic like this means it is guaranteed someone will eventually need to make further changes and forget to update both locations. It is tempting to just nuke the leading_directory special casing to avoid such bugs and simplify the code, but unpack_trees' verify_clean_subdirectory() also calls read_directory() and does so with a non-empty leading path, so I'm hesitant to try to restructure further. Add obnoxious warnings to treat_leading_path() and read_directory_recursive() to try to warn people of such problems. You would think that with such a strongly worded description, that its author would have actually ensured that the logic in treat_leading_path() and read_directory_recursive() did actually match and that *everything* that was needed had at least been copied over at the time that this paragraph was written. But you'd be wrong, I messed it up by missing part of the logic. Copy the missing bits to fix the new final test in t7300. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-25Merge branch 'en/fill-directory-fixes'Libravatar Junio C Hamano1-49/+138
Assorted fixes to the directory traversal API. * en/fill-directory-fixes: dir.c: use st_add3() for allocation size dir: consolidate similar code in treat_directory() dir: synchronize treat_leading_path() and read_directory_recursive() dir: fix checks on common prefix directory dir: break part of read_directory_recursive() out for reuse dir: exit before wildcard fall-through if there is no wildcard dir: remove stray quote character in comment Revert "dir.c: make 'git-status --ignored' work within leading directories" t3011: demonstrate directory traversal failures
2019-12-25Merge branch 'ds/sparse-cone'Libravatar Junio C Hamano1-8/+208
Management of sparsely checked-out working tree has gained a dedicated "sparse-checkout" command. * ds/sparse-cone: (21 commits) sparse-checkout: improve OS ls compatibility sparse-checkout: respect core.ignoreCase in cone mode sparse-checkout: check for dirty status sparse-checkout: update working directory in-process for 'init' sparse-checkout: cone mode should not interact with .gitignore sparse-checkout: write using lockfile sparse-checkout: use in-process update for disable subcommand sparse-checkout: update working directory in-process sparse-checkout: sanitize for nested folders unpack-trees: add progress to clear_ce_flags() unpack-trees: hash less in cone mode sparse-checkout: init and set in cone mode sparse-checkout: use hashmaps for cone patterns sparse-checkout: add 'cone' mode trace2: add region in clear_ce_flags sparse-checkout: create 'disable' subcommand sparse-checkout: add '--stdin' option to set subcommand sparse-checkout: 'set' subcommand clone: add --sparse mode sparse-checkout: create 'init' subcommand ...
2019-12-20dir.c: use st_add3() for allocation sizeLibravatar Junio C Hamano1-1/+1
When preparing a manufactured dirent instance, we add a length of path to the size of struct to decide how many bytes to allocate. Make sure this addition does not wrap-around to cause us underallocate. Suggested-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-19dir: consolidate similar code in treat_directory()Libravatar Elijah Newren1-11/+7
Both the DIR_SKIP_NESTED_GIT and DIR_NO_GITLINKS cases were checking for whether a path was actually a nonbare repository. That code could be shared, with just the result of how to act differing between the two cases. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-19dir: synchronize treat_leading_path() and read_directory_recursive()Libravatar Elijah Newren1-0/+30
Our optimization to avoid calling into read_directory_recursive() when all pathspecs have a common leading directory mean that we need to match the logic that read_directory_recursive() would use if we had just called it from the root. Since it does more than call treat_path() we need to copy that same logic. Alternatively, we could try to change treat_path to return path_recurse for an untracked directory under the given special circumstances that this logic checks for, but a simple switch results in many test failures such as 'git clean -d' not wiping out untracked but empty directories. To work around that, we'd need the caller of treat_path to check for path_recurse and sometimes special case it into path_untracked. In other words, we'd still have extra logic in both places. Needing to duplicate logic like this means it is guaranteed someone will eventually need to make further changes and forget to update both locations. It is tempting to just nuke the leading_directory special casing to avoid such bugs and simplify the code, but unpack_trees' verify_clean_subdirectory() also calls read_directory() and does so with a non-empty leading path, so I'm hesitant to try to restructure further. Add obnoxious warnings to treat_leading_path() and read_directory_recursive() to try to warn people of such problems. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-19dir: fix checks on common prefix directoryLibravatar Elijah Newren1-11/+56
Many years ago, the directory traversing logic had an optimization that would always recurse into any directory that was a common prefix of all the pathspecs without walking the leading directories to get down to the desired directory. Thus, git ls-files -o .git/ # case A would notice that .git/ was a common prefix of all pathspecs (since it is the only pathspec listed), and then traverse into it and start showing unknown files under that directory. Unfortunately, .git/ is not a directory we should be traversing into, which made this optimization problematic. This also affected cases like git ls-files -o --exclude-standard t/ # case B where t/ was in the .gitignore file and thus isn't interesting and shouldn't be recursed into. It also affected cases like git ls-files -o --directory untracked_dir/ # case C where untracked_dir/ is indeed untracked and thus interesting, but the --directory flag means we only want to show the directory itself, not recurse into it and start listing untracked files below it. The case B class of bugs were noted and fixed in commits 16e2cfa90993 ("read_directory(): further split treat_path()", 2010-01-08) and 48ffef966c76 ("ls-files: fix overeager pathspec optimization", 2010-01-08), with the idea being that we first wanted to check whether the common prefix was interesting. The former patch noted that treat_path() couldn't be used when checking the common prefix because treat_path() requires a dir_entry() and we haven't read any directories at the point we are checking the common prefix. So, that patch split treat_one_path() out of treat_path(). The latter patch then created a new treat_leading_path() which duplicated by hand the bits of treat_path() that couldn't be broken out and then called treat_one_path() for the remainder. There were three problems with this approach: * The duplicated logic in treat_leading_path() accidentally missed the check for special paths (such as is_dot_or_dotdot and matching ".git"), causing case A types of bugs to continue to be an issue. * The treat_leading_path() logic assumed we should traverse into anything where path_treatment was not path_none, i.e. it perpetuated class C types of bugs. * It meant we had split logic that needed to kept in sync, running the risk that people introduced new inconsistencies (such as in commit be8a84c52669, which we reverted earlier in this series, or in commit df5bcdf83ae which we'll fix in a subsequent commit) Fix most these problems by making treat_leading_path() not only loop over each leading path component, but calling treat_path() directly on each. To do so, we have to create a synthetic dir_entry, but that only takes a few lines. Then, pay attention to the path_treatment result we get from treat_path() and don't treat path_excluded, path_untracked, and path_recurse all the same as path_recurse. This leaves one remaining problem, the new inconsistency from commit df5bcdf83ae. That will be addressed in a subsequent commit. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-16Merge branch 'hw/doc-in-header'Libravatar Junio C Hamano1-2/+0
* hw/doc-in-header: trace2: move doc to trace2.h submodule-config: move doc to submodule-config.h tree-walk: move doc to tree-walk.h trace: move doc to trace.h run-command: move doc to run-command.h parse-options: add link to doc file in parse-options.h credential: move doc to credential.h argv-array: move doc to argv-array.h cache: move doc to cache.h sigchain: move doc to sigchain.h pathspec: move doc to pathspec.h revision: move doc to revision.h attr: move doc to attr.h refs: move doc to refs.h remote: move doc to remote.h and refspec.h sha1-array: move doc to sha1-array.h merge: move doc to ll-merge.h graph: move doc to graph.h and graph.c dir: move doc to dir.h diff: move doc to diff.h and diffcore.h
2019-12-13sparse-checkout: respect core.ignoreCase in cone modeLibravatar Derrick Stolee1-3/+12
When a user uses the sparse-checkout feature in cone mode, they add patterns using "git sparse-checkout set <dir1> <dir2> ..." or by using "--stdin" to provide the directories line-by-line over stdin. This behaviour naturally looks a lot like the way a user would type "git add <dir1> <dir2> ..." If core.ignoreCase is enabled, then "git add" will match the input using a case-insensitive match. Do the same for the sparse-checkout feature. Perform case-insensitive checks while updating the skip-worktree bits during unpack_trees(). This is done by changing the hash algorithm and hashmap comparison methods to optionally use case- insensitive methods. When this is enabled, there is a small performance cost in the hashing algorithm. To tease out the worst possible case, the following was run on a repo with a deep directory structure: git ls-tree -d -r --name-only HEAD | git sparse-checkout set --stdin The 'set' command was timed with core.ignoreCase disabled or enabled. For the repo with a deep history, the numbers were core.ignoreCase=false: 62s core.ignoreCase=true: 74s (+19.3%) For reproducibility, the equivalent test on the Linux kernel repository had these numbers: core.ignoreCase=false: 3.1s core.ignoreCase=true: 3.6s (+16%) Now, this is not an entirely fair comparison, as most users will define their sparse cone using more shallow directories, and the performance improvement from eb42feca97 ("unpack-trees: hash less in cone mode" 2019-11-21) can remove most of the hash cost. For a more realistic test, drop the "-r" from the ls-tree command to store only the first-level directories. In that case, the Linux kernel repository takes 0.2-0.25s in each case, and the deep repository takes one second, plus or minus 0.05s, in each case. Thus, we _can_ demonstrate a cost to this change, but it is unlikely to matter to any reasonable sparse-checkout cone. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-11dir: break part of read_directory_recursive() out for reuseLibravatar Elijah Newren1-23/+37
Create an add_path_to_appropriate_result_list() function from the code at the end of read_directory_recursive() so we can use it elsewhere. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-11dir: exit before wildcard fall-through if there is no wildcardLibravatar Elijah Newren1-0/+7
The DO_MATCH_LEADING_PATHSPEC had a fall-through case for if there was a wildcard, noting that we don't yet have enough information to determine if a further paths under the current directory might match due to the presence of wildcards. But if we have no wildcards in our pathspec, then we shouldn't get to that fall-through case. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-11dir: remove stray quote character in commentLibravatar Elijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-12-11Revert "dir.c: make 'git-status --ignored' work within leading directories"Libravatar Elijah Newren1-3/+0
Commit be8a84c52669 ("dir.c: make 'git-status --ignored' work within leading directories", 2013-04-15) noted that git status --ignored <SOMEPATH> would not list ignored files and directories within <SOMEPATH> if <SOMEPATH> was untracked, and modified the behavior to make it show them. However, it did so via a hack that broke consistency; it would show paths under <SOMEPATH> differently than a simple git status --ignored | grep <SOMEPATH> would show them. A correct fix is slightly more involved, and complicated slightly by this hack, so we revert this commit (but keep corrected versions of the testcases) and will later fix the original bug with a subsequent patch. Some history may be helpful: A very, very similar case to the commit we are reverting was raised in commit 48ffef966c76 ("ls-files: fix overeager pathspec optimization", 2010-01-08); but it actually went in somewhat the opposite direction. In that commit, it mentioned how git ls-files -o --exclude-standard t/ used to show untracked files under t/ even when t/ was ignored, and then changed the behavior to stop showing untracked files under an ignored directory. More importantly, this commit considered keeping this behavior but noted that it would be inconsistent with the behavior when multiple pathspecs were specified and thus rejected it. The reason for this whole inconsistency when one pathspec is specified versus zero or two is because common prefixes of pathspecs are sent through a different set of checks (in treat_leading_path()) than normal file/directory traversal (those go through read_directory_recursive() and treat_path()). As such, for consistency, one needs to check that both codepaths produce the same result. Revert commit be8a84c526691667fc04a8241d93a3de1de298ab, except instead of removing the testcase it added, modify it to check for correct and consistent behavior. A subsequent patch in this series will fix the testcase. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-22unpack-trees: hash less in cone modeLibravatar Derrick Stolee1-2/+2
The sparse-checkout feature in "cone mode" can use the fact that the recursive patterns are "connected" to the root via parent patterns to decide if a directory is entirely contained in the sparse-checkout or entirely removed. In these cases, we can skip hashing the paths within those directories and simply set the skipworktree bit to the correct value. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-22sparse-checkout: init and set in cone modeLibravatar Derrick Stolee1-4/+4
To make the cone pattern set easy to use, update the behavior of 'git sparse-checkout (init|set)'. Add '--cone' flag to 'git sparse-checkout init' to set the config option 'core.sparseCheckoutCone=true'. When running 'git sparse-checkout set' in cone mode, a user only needs to supply a list of recursive folder matches. Git will automatically add the necessary parent matches for the leading directories. When testing 'git sparse-checkout set' in cone mode, check the error stream to ensure we do not see any errors. Specifically, we want to avoid the warning that the patterns do not match the cone-mode patterns. Helped-by: Eric Wong <e@80x24.org> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-22sparse-checkout: use hashmaps for cone patternsLibravatar Derrick Stolee1-8/+199
The parent and recursive patterns allowed by the "cone mode" option in sparse-checkout are restrictive enough that we can avoid using the regex parsing. Everything is based on prefix matches, so we can use hashsets to store the prefixes from the sparse-checkout file. When checking a path, we can strip path entries from the path and check the hashset for an exact match. As a test, I created a cone-mode sparse-checkout file for the Linux repository that actually includes every file. This was constructed by taking every folder in the Linux repo and creating the pattern pairs here: /$folder/ !/$folder/*/ This resulted in a sparse-checkout file sith 8,296 patterns. Running 'git read-tree -mu HEAD' on this file had the following performance: core.sparseCheckout=false: 0.21 s (0.00 s) core.sparseCheckout=true: 3.75 s (3.50 s) core.sparseCheckoutCone=true: 0.23 s (0.01 s) The times in parentheses above correspond to the time spent in the first clear_ce_flags() call, according to the trace2 performance traces. While this example is contrived, it demonstrates how these patterns can slow the sparse-checkout feature. Helped-by: Eric Wong <e@80x24.org> Helped-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-18dir: move doc to dir.hLibravatar Heba Waly1-2/+0
Move the documentation from Documentation/technical/api-directory-listing.txt to dir.h as it's easier for the developers to find the usage information beside the code instead of looking for it in another doc file. Also documentation/technical/api-directory-listing.txt is removed because the information it has is now redundant and it'll be hard to keep it up to date and synchronized with the documentation in the header files. Signed-off-by: Heba Waly <heba.waly@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-11-10Fix spelling errors in code commentsLibravatar Elijah Newren1-1/+1
Reported-by: Jens Schleusener <Jens.Schleusener@fossies.org> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-10-11Merge branch 'en/clean-nested-with-ignored'Libravatar Junio C Hamano1-17/+48
"git clean" fixes. * en/clean-nested-with-ignored: dir: special case check for the possibility that pathspec is NULL clean: fix theoretical path corruption clean: rewrap overly long line clean: avoid removing untracked files in a nested git repository clean: disambiguate the definition of -d git-clean.txt: do not claim we will delete files with -n/--dry-run dir: add commentary explaining match_pathspec_item's return value dir: if our pathspec might match files under a dir, recurse into it dir: make the DO_MATCH_SUBMODULE code reusable for a non-submodule case dir: also check directories for matching pathspecs dir: fix off-by-one error in match_pathspec_item dir: fix typo in comment t7300: add testcases showing failure to clean specified pathspecs
2019-10-02dir: special case check for the possibility that pathspec is NULLLibravatar Elijah Newren1-3/+5
Commits 404ebceda01c ("dir: also check directories for matching pathspecs", 2019-09-17) and 89a1f4aaf765 ("dir: if our pathspec might match files under a dir, recurse into it", 2019-09-17) added calls to match_pathspec() and do_match_pathspec() passing along their pathspec parameter. Both match_pathspec() and do_match_pathspec() assume the pathspec argument they are given is non-NULL. It turns out that unpack-tree.c's verify_clean_subdirectory() calls read_directory() with pathspec == NULL, and it is possible on case insensitive filesystems for that NULL to make it to these new calls to match_pathspec() and do_match_pathspec(). Add appropriate checks on the NULLness of pathspec to avoid a segfault. In case the negation throws anyone off (one of the calls was to do_match_pathspec() while the other was to !match_pathspec(), yet no negation of the NULLness of pathspec is used), there are two ways to understand the differences: * The code already handled the pathspec == NULL cases before this series, and this series only tried to change behavior when there was a pathspec, thus we only want to go into the if-block if pathspec is non-NULL. * One of the calls is for whether to recurse into a subdirectory, the other is for after we've recursed into it for whether we want to remove the subdirectory itself (i.e. the subdirectory didn't match but something under it could have). That difference in situation leads to the slight differences in logic used (well, that and the slightly unusual fact that we don't want empty pathspecs to remove untracked directories by default). Denton found and analyzed one issue and provided the patch for the match_pathspec() call, SZEDER figured out why the issue only reproduced for some folks and not others and provided the testcase, and I looked through the remainder of the series and noted the do_match_pathspec() call that should have the same check. Co-authored-by: Denton Liu <liu.denton@gmail.com> Co-authored-by: SZEDER Gábor <szeder.dev@gmail.com> Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-30Merge branch 'ds/include-exclude'Libravatar Junio C Hamano1-136/+148
The internal code originally invented for ".gitignore" processing got reshuffled and renamed to make it less tied to "excluding" and stress more that it is about "matching", as it has been reused for things like sparse checkout specification that want to check if a path is "included". * ds/include-exclude: unpack-trees: rename 'is_excluded_from_list()' treewide: rename 'exclude' methods to 'pattern' treewide: rename 'EXCL_FLAG_' to 'PATTERN_FLAG_' treewide: rename 'struct exclude_list' to 'struct pattern_list' treewide: rename 'struct exclude' to 'struct path_pattern'
2019-09-17clean: avoid removing untracked files in a nested git repositoryLibravatar Elijah Newren1-0/+10
Users expect files in a nested git repository to be left alone unless sufficiently forced (with two -f's). Unfortunately, in certain circumstances, git would delete both tracked (and possibly dirty) files and untracked files within a nested repository. To explain how this happens, let's contrast a couple cases. First, take the following example setup (which assumes we are already within a git repo): git init nested cd nested >tracked git add tracked git commit -m init >untracked cd .. In this setup, everything works as expected; running 'git clean -fd' will result in fill_directory() returning the following paths: nested/ nested/tracked nested/untracked and then correct_untracked_entries() would notice this can be compressed to nested/ and then since "nested/" is a directory, we would call remove_dirs("nested/", ...), which would check is_nonbare_repository_dir() and then decide to skip it. However, if someone also creates an ignored file: >nested/ignored then running 'git clean -fd' would result in fill_directory() returning the same paths: nested/ nested/tracked nested/untracked but correct_untracked_entries() will notice that we had ignored entries under nested/ and thus simplify this list to nested/tracked nested/untracked Since these are not directories, we do not call remove_dirs() which was the only place that had the is_nonbare_repository_dir() safety check -- resulting in us deleting both the untracked file and the tracked (and possibly dirty) file. One possible fix for this issue would be walking the parent directories of each path and checking if they represent nonbare repositories, but that would be wasteful. Even if we added caching of some sort, it's still a waste because we should have been able to check that "nested/" represented a nonbare repository before even descending into it in the first place. Add a DIR_SKIP_NESTED_GIT flag to dir_struct.flags and use it to prevent fill_directory() and friends from descending into nested git repos. With this change, we also modify two regression tests added in commit 91479b9c72f1 ("t7300: add tests to document behavior of clean and nested git", 2015-06-15). That commit, nor its series, nor the six previous iterations of that series on the mailing list discussed why those tests coded the expectation they did. In fact, it appears their purpose was simply to test _existing_ behavior to make sure that the performance changes didn't change the behavior. However, these two tests directly contradicted the manpage's claims that two -f's were required to delete files/directories under a nested git repository. While one could argue that the user gave an explicit path which matched files/directories that were within a nested repository, there's a slippery slope that becomes very difficult for users to understand once you go down that route (e.g. what if they specified "git clean -f -d '*.c'"?) It would also be hard to explain what the exact behavior was; avoid such problems by making it really simple. Also, clean up some grammar errors describing this functionality in the git-clean manpage. Finally, there are still a couple bugs with -ffd not cleaning out enough (e.g. missing the nested .git) and with -ffdX possibly cleaning out the wrong files (paying attention to outer .gitignore instead of inner). This patch does not address these cases at all (and does not change the behavior relative to those flags), it only fixes the handling when given a single -f. See https://public-inbox.org/git/20190905212043.GC32087@szeder.dev/ for more discussion of the -ffd[X?] bugs. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-17dir: add commentary explaining match_pathspec_item's return valueLibravatar Elijah Newren1-8/+19
The way match_pathspec_item() handles names and pathspecs with trailing slash characters, in conjunction with special options like DO_MATCH_DIRECTORY and DO_MATCH_LEADING_PATHSPEC were non-obvious, and broken until this patch series. Add a table in a comment explaining the intent of how these work. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-17dir: if our pathspec might match files under a dir, recurse into itLibravatar Elijah Newren1-4/+6
For git clean, if a directory is entirely untracked and the user did not specify -d (corresponding to DIR_SHOW_IGNORED_TOO), then we usually do not want to remove that directory and thus do not recurse into it. However, if the user manually specified specific (or even globbed) paths somewhere under that directory to remove, then we need to recurse into the directory to make sure we remove the relevant paths under that directory as the user requested. Note that this does not mean that the recursed-into directory will be added to dir->entries for later removal; as of a few commits earlier in this series, there is another more strict match check that is run after returning from a recursed-into directory before deciding to add it to the list of entries. Therefore, this will only result in files underneath the given directory which match one of the pathspecs being added to the entries list. Two notes of potential interest to future readers: * If we wanted to only recurse into a directory when it is specifically matched rather than matched-via-glob (e.g. '*.c'), then we could do so via making the final non-zero return in match_pathspec_item be MATCHED_RECURSIVELY instead of MATCHED_RECURSIVELY_LEADING_PATHSPEC. (Note that the relative order of MATCHED_RECURSIVELY_LEADING_PATHSPEC and MATCHED_RECURSIVELY are important for such a change.) I was leaving open that possibility while writing an RFC asking for the behavior we want, but even though we don't want it, that knowledge might help you understand the code flow better. * There is a growing amount of logic in read_directory_recursive() for deciding whether to recurse into a subdirectory. However, there is a comment immediately preceding this logic that says to recurse if instructed by treat_path(). It may be better for the logic in read_directory_recursive() to ultimately be moved to treat_path() (or another function it calls, such as treat_directory()), but I have left that for someone else to tackle in the future. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-17dir: make the DO_MATCH_SUBMODULE code reusable for a non-submodule caseLibravatar Elijah Newren1-3/+3
The specific checks done in match_pathspec_item for the DO_MATCH_SUBMODULE case are useful for other cases which have nothing to do with submodules. Rename this constant; a subsequent commit will make use of this change. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-17dir: also check directories for matching pathspecsLibravatar Elijah Newren1-0/+5
Even if a directory doesn't match a pathspec, it is possible, depending on the precise pathspecs, that some file underneath it might. So we special case and recurse into the directory for such situations. However, we previously always added any untracked directory that we recursed into to the list of untracked paths, regardless of whether the directory itself matched the pathspec. For the case of git-clean and a set of pathspecs of "dir/file" and "more", this caused a problem because we'd end up with dir entries for both of "dir" "dir/file" Then correct_untracked_entries() would try to helpfully prune duplicates for us by removing "dir/file" since it's under "dir", leaving us with "dir" Since the original pathspec only had "dir/file", the only entry left doesn't match and leaves nothing to be removed. (Note that if only one pathspec was specified, e.g. only "dir/file", then the common_prefix_len optimizations in fill_directory would cause us to bypass this problem, making it appear in simple tests that we could correctly remove manually specified pathspecs.) Fix this by actually checking whether the directory we are about to add to the list of dir entries actually matches the pathspec; only do this matching check after we have already returned from recursing into the directory. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-17dir: fix off-by-one error in match_pathspec_itemLibravatar Elijah Newren1-1/+2
For a pathspec like 'foo/bar' comparing against a path named "foo/", namelen will be 4, and match[namelen] will be 'b'. The correct location of the directory separator is namelen-1. However, other callers of match_pathspec_item() such as builtin/grep.c's submodule_path_match() will compare against a path named "foo" instead of "foo/". It might be better to change all the callers to be consistent, as discussed at https://public-inbox.org/git/xmqq7e6cdnkr.fsf@gitster-ct.c.googlers.com/ and https://public-inbox.org/git/CABPp-BERWUPCPq-9fVW1LNocqkrfsoF4BPj3gJd9+En43vEkTQ@mail.gmail.com/ but there are many cases to audit, so for now just make sure we handle both cases with and without a trailing slash. The reason the code worked despite this sometimes-off-by-one error was that the subsequent code immediately checked whether the first matchlen characters matched (which they do) and then bailed and return MATCHED_RECURSIVELY anyway since wildmatch doesn't have the ability to check if "name" can be matched as a directory (or prefix) against the pathspec. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-17dir: fix typo in commentLibravatar Elijah Newren1-1/+1
Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-05unpack-trees: rename 'is_excluded_from_list()'Libravatar Derrick Stolee1-8/+17
The first consumer of pattern-matching filenames was the .gitignore feature. In that context, storing a list of patterns as a 'struct exclude_list' makes sense. However, the sparse-checkout feature then adopted these structures and methods, but with the opposite meaning: these patterns match the files that should be included! Now that this library is renamed to use 'struct pattern_list' and 'struct pattern', we can now rename the method used by the sparse-checkout feature to determine which paths should appear in the working directory. The method is_excluded_from_list() is only used by the sparse-checkout logic in unpack-trees and list-objects-filter. The confusing part is that it returned 1 for "excluded" (i.e. it matches the list of exclusions) but that really manes that the path matched the list of patterns for _inclusion_ in the working directory. Rename the method to be path_matches_pattern_list() and have it return an explicit 'enum pattern_match_result'. Here, the values MATCHED = 1, UNMATCHED = 0, and UNDECIDED = -1 agree with the previous integer values. This shift allows future consumers to better understand what the retur values mean, and provides more type checking for handling those values. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-05treewide: rename 'exclude' methods to 'pattern'Libravatar Derrick Stolee1-39/+39
The first consumer of pattern-matching filenames was the .gitignore feature. In that context, storing a list of patterns as a 'struct exclude_list' makes sense. However, the sparse-checkout feature then adopted these structures and methods, but with the opposite meaning: these patterns match the files that should be included! It would be clearer to rename this entire library as a "pattern matching" library, and the callers apply exclusion/inclusion logic accordingly based on their needs. This commit renames several methods defined in dir.h to make more sense with the renamed 'struct exclude_list' to 'struct pattern_list' and 'struct exclude' to 'struct path_pattern': * last_exclude_matching() -> last_matching_pattern() * parse_exclude() -> parse_path_pattern() In addition, the word 'exclude' was replaced with 'pattern' in the methods below: * add_exclude_list() * add_excludes_from_file_to_list() * add_excludes_from_file() * add_excludes_from_blob_to_list() * add_exclude() * clear_exclude_list() A few methods with the word "exclude" remain. These will be handled seperately. In particular, the method "is_excluded()" is concretely about the .gitignore file relative to a specific directory. This is the important boundary between library and consumer: is_excluded() cares about .gitignore, but is_excluded() calls last_matching_pattern() to make that decision. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-05treewide: rename 'EXCL_FLAG_' to 'PATTERN_FLAG_'Libravatar Derrick Stolee1-11/+11
The first consumer of pattern-matching filenames was the .gitignore feature. In that context, storing a list of patterns as a 'struct exclude_list' makes sense. However, the sparse-checkout feature then adopted these structures and methods, but with the opposite meaning: these patterns match the files that should be included! It would be clearer to rename this entire library as a "pattern matching" library, and the callers apply exclusion/inclusion logic accordingly based on their needs. This commit replaces 'EXCL_FLAG_' to 'PATTERN_FLAG_' in the names of the flags used on 'struct path_pattern'. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-05treewide: rename 'struct exclude_list' to 'struct pattern_list'Libravatar Derrick Stolee1-53/+53
The first consumer of pattern-matching filenames was the .gitignore feature. In that context, storing a list of patterns as a 'struct exclude_list' makes sense. However, the sparse-checkout feature then adopted these structures and methods, but with the opposite meaning: these patterns match the files that should be included! It would be clearer to rename this entire library as a "pattern matching" library, and the callers apply exclusion/inclusion logic accordingly based on their needs. This commit renames 'struct exclude_list' to 'struct pattern_list' and renames several variables called 'el' to 'pl'. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-09-05treewide: rename 'struct exclude' to 'struct path_pattern'Libravatar Derrick Stolee1-56/+59
The first consumer of pattern-matching filenames was the .gitignore feature. In that context, storing a list of patterns as a list of 'struct exclude' items makes sense. However, the sparse-checkout feature then adopted these structures and methods, but with the opposite meaning: these patterns match the files that should be included! It would be clearer to rename this entire library as a "pattern matching" library, and the callers apply exclusion/inclusion logic accordingly based on their needs. This commit renames 'struct exclude' to 'struct path_pattern' and renames several variable names to match. 'struct pattern' was already taken by attr.c, and this more completely describes that the patterns are specific to file paths. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-06-13cleanup: fix possible overflow errors in binary search, part 2Libravatar René Scharfe1-1/+1
Calculating the sum of two array indexes to find the midpoint between them can overflow, i.e. code like this is unsafe for big arrays: mid = (first + last) >> 1; Make sure the intermediate value stays within the boundaries instead, like this: mid = first + ((last - first) >> 1); The loop condition of the binary search makes sure that 'last' is always greater than 'first', so this is safe as long as 'first' is not negative. And that can be verified easily using the pre-context of each change, except for name-hash.c, so add an assertion to that effect there. The unsafe calculations were found with: git grep '(.*+.*) *>> *1' This is a continuation of 19716b21a4 (cleanup: fix possible overflow errors in binary search, 2017-10-08). Signed-off-by: Rene Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-05-09Merge branch 'jk/untracked-cache-more-fixes'Libravatar Junio C Hamano1-23/+18
Code clean-up. * jk/untracked-cache-more-fixes: untracked-cache: simplify parsing by dropping "len" untracked-cache: simplify parsing by dropping "next" untracked-cache: be defensive about missing NULs in index
2019-05-09Merge branch 'nd/sha1-name-c-wo-the-repository'Libravatar Junio C Hamano1-0/+8
Further code clean-up to allow the lowest level of name-to-object mapping layer to work with a passed-in repository other than the default one. * nd/sha1-name-c-wo-the-repository: (34 commits) sha1-name.c: remove the_repo from get_oid_mb() sha1-name.c: remove the_repo from other get_oid_* sha1-name.c: remove the_repo from maybe_die_on_misspelt_object_name submodule-config.c: use repo_get_oid for reading .gitmodules sha1-name.c: add repo_get_oid() sha1-name.c: remove the_repo from get_oid_with_context_1() sha1-name.c: remove the_repo from resolve_relative_path() sha1-name.c: remove the_repo from diagnose_invalid_index_path() sha1-name.c: remove the_repo from handle_one_ref() sha1-name.c: remove the_repo from get_oid_1() sha1-name.c: remove the_repo from get_oid_basic() sha1-name.c: remove the_repo from get_describe_name() sha1-name.c: remove the_repo from get_oid_oneline() sha1-name.c: add repo_interpret_branch_name() sha1-name.c: remove the_repo from interpret_branch_mark() sha1-name.c: remove the_repo from interpret_nth_prior_checkout() sha1-name.c: remove the_repo from get_short_oid() sha1-name.c: add repo_for_each_abbrev() sha1-name.c: store and use repo in struct disambiguate_state sha1-name.c: add repo_find_unique_abbrev_r() ...
2019-05-09Merge branch 'km/empty-repo-is-still-a-repo'Libravatar Junio C Hamano1-2/+4
Running "git add" on a repository created inside the current repository is an explicit indication that the user wants to add it as a submodule, but when the HEAD of the inner repository is on an unborn branch, it cannot be added as a submodule. Worse, the files in its working tree can be added as if they are a part of the outer repository, which is not what the user wants. These problems are being addressed. * km/empty-repo-is-still-a-repo: add: error appropriately on repository with no commits dir: do not traverse repositories with no commits submodule: refuse to add repository with no commits
2019-04-25Merge branch 'js/untracked-cache-allocfix'Libravatar Junio C Hamano1-1/+1
An underallocation in the code to read the untracked cache extension has been corrected. * js/untracked-cache-allocfix: untracked cache: fix off-by-one
2019-04-25Merge branch 'bc/hash-transition-16'Libravatar Junio C Hamano1-14/+14
Conversion from unsigned char[20] to struct object_id continues. * bc/hash-transition-16: (35 commits) gitweb: make hash size independent Git.pm: make hash size independent read-cache: read data in a hash-independent way dir: make untracked cache extension hash size independent builtin/difftool: use parse_oid_hex refspec: make hash size independent archive: convert struct archiver_args to object_id builtin/get-tar-commit-id: make hash size independent get-tar-commit-id: parse comment record hash: add a function to lookup hash algorithm by length remote-curl: make hash size independent http: replace sha1_to_hex http: compute hash of downloaded objects using the_hash_algo http: replace hard-coded constant with the_hash_algo http-walker: replace sha1_to_hex http-push: remove remaining uses of sha1_to_hex http-backend: allow 64-character hex names http-push: convert to use the_hash_algo builtin/pull: make hash-size independent builtin/am: make hash size independent ...
2019-04-19untracked-cache: simplify parsing by dropping "len"Libravatar Jeff King1-8/+5
The code which parses untracked-cache extensions from disk keeps a "len" variable, which is the size of the string we are parsing. But since we now have an "end of string" variable, we can just use that to get the length when we need it. This eliminates the need to keep "len" up to date (and removes the possibility of any errors where "len" and "eos" get out of sync). As a bonus, it means we are not storing a string length in an "int", which is a potential source of overflows (though in this case it seems fairly unlikely for that to cause any memory problems). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-19untracked-cache: simplify parsing by dropping "next"Libravatar Jeff King1-13/+7
When we parse an on-disk untracked cache, we have two pointers, "data" and "next". As we parse, we point "next" to the end of an element, and then later update "data" to match. But we actually don't need two pointers. Each parsing step can just update "data" directly from other variables we hold (and we don't have to worry about bailing in an intermediate state, since any parsing failure causes us to immediately discard "data" and return). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-19untracked-cache: be defensive about missing NULs in indexLibravatar Jeff King1-7/+11
The on-disk format for the untracked-cache extension contains NUL-terminated filenames. We parse these from the mmap'd file using string functions like strlen(). This works fine in the normal case, but if we see a malformed or corrupted index, we might read off the end of our mmap. Instead, let's use memchr() to find the trailing NUL within the bytes we know are available, and return an error if it's missing. Note that we can further simplify by folding another range check into our conditional. After we find the end of the string, we set "next" to the byte after the string and treat it as an error if there are no such bytes left. That saves us from having to do a range check at the beginning of each subsequent string (and works because there is always data after each string). We can do both range checks together by checking "!eos" (we didn't find a NUL) and "eos == end" (it was on the last available byte, meaning there's nothing after). This replaces the existing "next > end" checks. Note also that the decode_varint() calls have a similar problem (we don't even pass them "end"; they just keep parsing). These are probably OK in practice since varints have a finite length (we stop parsing when we'd overflow a uintmax_t), so the worst case is that we'd overflow into reading the trailing bytes of the index. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-16sha1-name.c: remove the_repo from diagnose_invalid_index_path()Libravatar Nguyễn Thái Ngọc Duy1-0/+8
Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-12untracked cache: fix off-by-oneLibravatar Johannes Schindelin1-1/+1
In f9e6c649589e (untracked cache: load from UNTR index extension, 2015-03-08), code was added to read back the untracked cache from an index extension. Probably in the endeavor to avoid the `calloc()` implied by `FLEX_ALLOC_STR()` (it is hard to know why exactly, the commit message of that commit is a bit parsimonious with information), it calls `malloc()` manually and then `memcpy()`s the bits and pieces into place. It allocates the size of `struct untracked_cache_dir` plus the string length of the untracked file name, then copies the information in two steps: first the fixed-size metadata, then the name. And here lies the rub: it includes the trailing NUL byte in the name. If `FLEX_ARRAY` is defined as 0, this results in a buffer overrun. To fix this, let's just add 1, for the trailing NUL byte. Technically, this overallocates on platforms where `FLEX_ARRAY` is 1, but it should not matter much in reality, as `malloc()` usually overallocates anyway, unless the size to allocate aligns exactly with some internal chunk size (see below for more on that). The real strange thing is that neither valgrind nor DrMemory catches this bug. In this developer's tests, a `memcpy()` (but not a `memset()`!) could write up to 4 bytes after the allocated memory range before valgrind would start reporting an issue. However, when running Git built with nedmalloc as allocator, under rare conditions (and inconsistently at that), this bug triggered an `abort()` because nedmalloc rounds up the size to be `malloc()`ed to a multiple of a certain chunk size, then adds a few bytes to be used for storing some internal state. If there is no rounding up to do (because the size is already a multiple of that chunk size), and if the buffer is overrun as in the code patched in this commit, the internal state is corrupted. The scenario that triggered this here bug fix entailed a git.git checkout with an extra copy of the source code in an untracked subdirectory, meaning that there was an untracked subdirectory called "thunderbird-patch-inline" whose name's length is exactly 24 bytes, which, added to the size of above-mentioned `struct untracked_cache_dir` that weighs in with 104 bytes on a 64-bit system, amounts to 128, aligning perfectly with nedmalloc's chunk size. As there is no obvious way to trigger this bug reliably, on all platforms supported by Git, and as the bug is obvious enough, this patch comes without a regression test. Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-10dir: do not traverse repositories with no commitsLibravatar Kyle Meyer1-2/+4
When treat_directory() encounters a directory that is not in the index and DIR_NO_GITLINKS is unset, it calls resolve_gitlink_ref() to decide if a directory looks like a repository, in which case the directory won't be traversed. As a result, 'status -uall' and 'ls-files -o' will show only the directory, even when there are untracked files within the directory. For the unusual case where a repository doesn't have a commit checked out, resolve_gitlink_ref() returns -1 because HEAD cannot be resolved, and the directory is treated as a normal directory (i.e. traversal does not stop at the repository boundary). The status and ls-files commands above list untracked files within the repository rather than showing only the top-level directory. And if 'git add' is called on a repository with no commit checked out, any untracked files under the repository are added as blobs in the top-level project, a behavior that is unlikely to be what the caller intended. The above case is a corner case in an already unusual situation of the working tree containing a repository that is not a tracked submodule, but we might as well treat anything that looks like a repository consistently. Loosen the "looks like a repository" criteria in treat_directory() by replacing resolve_gitlink_ref() with is_nonbare_repository_dir(), one of the checks that is performed downstream when resolve_gitlink_ref() is called. As the required update to t3700-add shows, calling 'git add' on a repository with no commit checked out will now raise an error. While this is the desired behavior, note that the output isn't yet appropriate. The next commit will improve this output. Signed-off-by: Kyle Meyer <kyle@kyleam.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-04-01dir: make untracked cache extension hash size independentLibravatar brian m. carlson1-14/+14
Instead of using a struct with a flex array member to read and write the untracked cache extension, use a shorter, fixed-length struct and add the name and hash data explicitly. Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-03-20report_path_error(): drop unused prefix parameterLibravatar Jeff King1-2/+1
This hasn't been used since 17ddc66e70 (convert report_path_error to take struct pathspec, 2013-07-14), as the names in the struct will have already been prefixed when they were parsed. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
2019-02-06Merge branch 'nd/the-index-final'Libravatar Junio C Hamano1-1/+0
The assumption to work on the single "in-core index" instance has been reduced from the library-ish part of the codebase. * nd/the-index-final: cache.h: flip NO_THE_INDEX_COMPATIBILITY_MACROS switch read-cache.c: remove the_* from index_has_changes() merge-recursive.c: remove implicit dependency on the_repository merge-recursive.c: remove implicit dependency on the_index sha1-name.c: remove implicit dependency on the_index read-cache.c: replace update_index_if_able with repo_& read-cache.c: kill read_index() checkout: avoid the_index when possible repository.c: replace hold_locked_index() with repo_hold_locked_index() notes-utils.c: remove the_repository references grep: use grep_opt->repo instead of explict repo argument