convert: stream from fd to required clean filter to reduce used address space

The data is streamed to the filter process anyway. Better avoid mapping the file if possible. This is especially useful if a clean filter reduces the size, for example if it computes a sha1 for binary data, like git media. The file size that the previous implementation could handle was limited by the available address space; large files for example could not be handled with (32-bit) msysgit. The new implementation can filter files of any size as long as the filter output is small enough. The new code path is only taken if the filter is required. The filter consumes data directly from the fd. If it fails, the original data is not immediately available. The condition can easily be handled as a fatal error, which is expected for a required filter anyway. If the filter was not required, the condition would need to be handled in a different way, like seeking to 0 and reading the data. But this would require more restructuring of the code and is probably not worth it. The obvious approach of falling back to reading all data would not help achieving the main purpose of this patch, which is to handle large files with limited address space. If reading all data is an option, we can simply take the old code path right away and mmap the entire file. The environment variable GIT_MMAP_LIMIT, which has been introduced in a previous commit is used to test that the expected code path is taken. A related test that exercises required filters is modified to verify that the data actually has been modified on its way from the file system to the object store. Signed-off-by: Steffen Prohaska <prohaska@zib.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Steffen Prohaska <prohaska@zib.de> 2014-08-26 17:23:25 +0200
committer: Junio C Hamano <gitster@pobox.com> 2014-08-28 10:25:15 -0700
commit: 9035d75a2be9d80d82676504d69553245017f6d4 (patch)
tree: a275b416bd102af42760a60edc77594744dbeb7e /sha1_file.c
parent: copy_fd(): do not close the input file descriptor (diff)
download: tgif-9035d75a2be9d80d82676504d69553245017f6d4.tar.xz
1 files changed, 26 insertions, 1 deletions
diff --git a/sha1_file.c b/sha1_file.c
index d9b51578cc..423ec64e87 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3090,6 +3090,29 @@ static int index_mem(unsigned char *sha1, void *buf, size_t size,
 	return ret;
 }
 
+static int index_stream_convert_blob(unsigned char *sha1, int fd,
+				     const char *path, unsigned flags)
+{
+	int ret;
+	const int write_object = flags & HASH_WRITE_OBJECT;
+	struct strbuf sbuf = STRBUF_INIT;
+
+	assert(path);
+	assert(would_convert_to_git_filter_fd(path));
+
+	convert_to_git_filter_fd(path, fd, &sbuf,
+				 write_object ? safe_crlf : SAFE_CRLF_FALSE);
+
+	if (write_object)
+		ret = write_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
+				      sha1);
+	else
+		ret = hash_sha1_file(sbuf.buf, sbuf.len, typename(OBJ_BLOB),
+				     sha1);
+	strbuf_release(&sbuf);
+	return ret;
+}
+
 static int index_pipe(unsigned char *sha1, int fd, enum object_type type,
 		      const char *path, unsigned flags)
 {
@@ -3157,7 +3180,9 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st,
 	int ret;
 	size_t size = xsize_t(st->st_size);
 
-	if (!S_ISREG(st->st_mode))
+	if (type == OBJ_BLOB && path && would_convert_to_git_filter_fd(path))
+		ret = index_stream_convert_blob(sha1, fd, path, flags);
+	else if (!S_ISREG(st->st_mode))
 		ret = index_pipe(sha1, fd, type, path, flags);
 	else if (size <= big_file_threshold || type != OBJ_BLOB ||
 		 (path && would_convert_to_git(path)))
author	Steffen Prohaska <prohaska@zib.de>	2014-08-26 17:23:25 +0200
committer	Junio C Hamano <gitster@pobox.com>	2014-08-28 10:25:15 -0700
commit	9035d75a2be9d80d82676504d69553245017f6d4 (patch)
tree	a275b416bd102af42760a60edc77594744dbeb7e /sha1_file.c
parent	copy_fd(): do not close the input file descriptor (diff)
download	tgif-9035d75a2be9d80d82676504d69553245017f6d4.tar.xz