-
Notifications
You must be signed in to change notification settings - Fork 58
/
TODO
338 lines (254 loc) · 11 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
- Add support for logging to file (with different level?)
- Add support for libarchive filters in dwarfsextract.
- When hashing, start by only hashing the first, say, 4KiB,
and only if the hashes are identical, hash the whole file
- Add support for compressing uncompressed image formats,
currently primarily FITS, using different formats (e.g.
JPEG2K, Rice, Hcompress).
- Fragment counts don't necessarily match up in presence
of errors.
- When opening a file again, check that its timestamp,
size and potentially checksum did not change from
when we first saw it.
- Option to log to a file instead of stderr?
- Use thrift definitions for all options to make them
easily printable/storable?
- Use Elias-Fano for delta-encoded lists in metadata?
- Implement rewriting properly; keep order of blocks etc;
ability to remove history?; ability to re-pack metadata?;
ability to change other metadata properties (e.g. stuff
like --set-owner, --set-group, --chmod, --no-create-ts,
--set-time could all be done on existing metadata, but
obviously wouldn't be undo-able)
- Packaging of libs added via FetchContent
- Re-assemble global bloom filter rather than merging?
- Use smaller bloom filters for individual blocks?
- Use bigger (non-resettable?) global bloom filter?
- file discovery progress?
- show defaults for categorized options
- take a look at CPU measurements, those for nilsimsa
ordering are probably wrong
- segmenter tests with different granularities, block sizes,
any other options
- Bloom filters can be wasteful if lookback gets really long.
Maybe we can use smaller bloom filters for individual blocks
and one or two larger "global" bloom filters? It's going to
be impossible to rebuild those from the smaller filters,
though.
- Compress long repetitions of the same byte more efficiently.
Currently, segmentation finds an overlap after about one
window size. This goes on and on repeatedly. So we end up
with a *lot* of chunks pointing to the same segment. The
smaller the window size, the larger the number of chunks.
It's definitely a trade off, as storing large segments of
repeating bytes is wasteful when mounting the image.
Intriguing idea: pre-compute 256 (or just 2, for 0x00 and 0xFF)
hash values for window_size bytes to detect long sequences of
identical bytes.
OTHER intriguing idea: let a categorizer (could even be the
incompressible categorizer, but also "sparse file" categorizer
or something like that) detect these repetitions up front so
the segmenter doesn't have to do it (and it can be optional).
Then, we can customize the segmenter to run *extremely* fast
in this case.
- Wiki with use cases
- Perl releases
- Videos with shared streams
- Backups of audio files
- Compression of filesystem images for forensic purposes
- Mounting lots of images with shared cache?
- different scenarios for categorized files / chunks:
- Video files
- just store without compression, but perform segmentation, nothing special
- keep in lookback buffer for longer, as it doesn't cost memory
- needs parallelized segmenter (see below)
- PCM audio (see also github #95)
- segment first in case of e.g. different trims or other types of overlap
- split into chunks (for parallel decompression)
- compress each chunk as flac
- headers to be saved separately
- need to store original size and other information
This is actually quite easy:
- Identify PCM audio files (libmagic?)
- Nilsimsa similarity works surprisingly well
- We can potentially switch to larger window size for segmentation and use
larger lookback
- Run segmentation as usual
- Compress each block using FLAC (hopefully we can configure how much header data
and/or seek points etc. gets stored) or maybe even WAVPACK is we don't need perf
- I *think* this can be done even with the current metadata format without any
additions
- The features needed should be largely orthogonal to the features needed for the
scenarios below
- Executables, Shared Libs, ...
- run filter over blocks that contain a certain kind of binary data before
compression
- a single binary may contain machine code for different architectures,
so we may have to store different parts of the binary in different blocks
- actually quite similar to audio files above, except for the additional
filter used during compression/decompression
- JPG
- just store a recompressed version
- need to store original size
- no need for segmentation except for exact
- PDF
- decompress contents
- then segment
- then compress along with other PDFs (or documents in general)
- Other compressed format (gz, xz, ...)
- decompress
- segment
- compress
- essentially like PDF
- maybe only do this for small files? (option for size limit?)
- It should be possible to treat individual chunks differently, e.g.
WAV-header should be stored independently from contents; at some
point, we might look deeper into tar files and compress individual
contents differently.
- in the metadata, we need to know:
- the fact that a stored inode is "special" (can be reflected in a single bit)
- the type of "specialness"
- the original file size
- multi-threaded pre-matcher (for -Bn with n > 0)
- pre-compute matches/cyclic hashes for completed blocks; these don't
change and so we can do this with very little synchronization
- there are two possible strategies:
- split the input stream into chunks and then process each chunk in
a separate thread, checking all n blocks
- process the input stream in each thread and then only checking a
subset of past blocks (this seems more wasteful, but each thread
would only operate on a few instead of all bloom filters, which
could be better from a cache locality pov)
- similarity size limit to avoid similarity computation for huge files
- store files without similarity hash first, sorted descending by size
- use streaming interface for zstd decompressor
- json metadata recovery
- handle sparse files?
- try to be more resilient to modifications of the input while creating fs
- dwarfsck:
- show which entries a block references
- show partial metadata dumps at lower detail levels
- make dwarfsck more usable
- cleanup TODOs
- folly: dynamic should support string_view
- frozen: ViewBase.getPosition() should be const
- docs, moar tests
- extended attributes:
- number of blocks
- number of chunks
- number of times opened?
- per-file "hotness" (how often was a file opened); dump to file upon umount
- --unpack option
- readahead?
- window-increment-shift seems silly to configure?
- identify blocks that contain mostly binary data and adjust compressor?
- metadata stripping (i.e. re-write metadata without owner/time info)
- metadata repacking (e.g. just recompress/decompress the metadata block)
/*
scanner:
bhw= - 388.3s 13.07 GiB
bhw= 8 812.9s 7.559 GiB
bhw= 9 693.1s 7.565 GiB
bhw=10 651.8s 7.617 GiB
bhw=11 618.7s 7.313 GiB
bhw=12 603.6s 7.625 GiB
bhw=13 591.2s 7.858 GiB
bhw=14 574.1s 8.306 GiB
bhw=15 553.8s 8.869 GiB
bhw=16 541.9s 9.529 GiB
lz4:
<---- 1m29.535s / 9m31.212s
lz4hc:
1 - 20.94s - 2546 MiB
2 - 21.67s - 2441 MiB
3 - 24.19s - 2377 MiB
4 - 27.29s - 2337 MiB
5 - 31.49s - 2311 MiB
6 - 36.39s - 2294 MiB
7 - 42.04s - 2284 MiB
8 - 48.67s - 2277 MiB
9 - 56.94s - 2273 MiB <---- 1m27.979s / 9m20.637s
10 - 68.03s - 2271 MiB
11 - 79.54s - 2269 MiB
12 - 94.84s - 2268 MiB
zstd:
1 - 11.42s - 1667 MiB
2 - 12.95s - 1591 MiB <---- 2m8.351s / 15m25.752s
3 - 22.03s - 1454 MiB
4 - 25.64s - 1398 MiB
5 - 32.34s - 1383 MiB
6 - 41.45s - 1118 MiB <---- 2m4.258s / 14m28.627s
7 - 46.26s - 1104 MiB
8 - 53.34s - 1077 MiB
9 - 59.99s - 1066 MiB
10 - 63.3s - 1066 MiB
11 - 66.97s - 956 MiB <---- 2m3.496s / 14m17.862s
12 - 79.89s - 953 MiB
13 - 89.8s - 943 MiB
14 - 118.1s - 941 MiB
15 - 230s - 951 MiB
16 - 247.4s - 863 MiB <---- 2m11.202s / 14m57.245s
17 - 294.5s - 854 MiB
18 - 634s - 806 MiB
19 - 762.5s - 780 MiB
20 - 776.8s - 718 MiB <---- 2m16.448s / 15m43.923s
21 - 990.4s - 716 MiB
22 - 984.3s - 715 MiB <---- 2m18.133s / 15m55.263s
lzma:
level=6:dict_size=21 921.9s - 838.8 MiB <---- 5m11.219s / 37m36.002s
*/
Perl:
542 versions of perl
found/scanned: 152809/152809 dirs, 0/0 links, 1325098/1325098 files
original size: 32.03 GiB, saved: 19.01 GiB by deduplication (1133032 duplicate files), 5.835 GiB by segmenting
filesystem size: 7.183 GiB in 460 blocks (499389 chunks, 192066/192066 inodes), 460 blocks/662.3 MiB written
bench
build real user
-----------------------------------------------------------------------------------------------------
-rw-r--r-- 1 mhx users 14G Jul 27 23:11 perl-install-0.dwarfs 8:05 0:38 0:45
-rw-r--r-- 1 mhx users 4.8G Jul 27 23:18 perl-install-1.dwarfs 6:34 0:14 1:24
-rw-r--r-- 1 mhx users 3.8G Jul 27 23:26 perl-install-2.dwarfs 7:31 0:17 1:11
-rw-r--r-- 1 mhx users 3.2G Jul 27 23:36 perl-install-3.dwarfs 10:11 0:11 0:59
-rw-r--r-- 1 mhx users 1.8G Jul 27 23:47 perl-install-4.dwarfs 11:05 0:14 1:24
-rw-r--r-- 1 mhx users 1.2G Jul 27 23:59 perl-install-5.dwarfs 11:53 0:13 1:15
-rw-r--r-- 1 mhx users 901M Jul 28 00:16 perl-install-6.dwarfs 17:42 0:14 1:25
-rw-r--r-- 1 mhx users 704M Jul 28 00:37 perl-install-7.dwarfs 20:52 0:20 2:14
-rw-r--r-- 1 mhx users 663M Jul 28 04:04 perl-install-8.dwarfs 24:13 0:50 6:02
-rw-r--r-- 1 mhx users 615M Jul 28 02:50 perl-install-9.dwarfs 34:40 0:51 5:50
-rw-r--r-- 1 mhx users 3.6G Jul 28 09:13 perl-install-defaults.squashfs 17:20
-rw-r--r-- 1 mhx users 2.4G Jul 28 10:42 perl-install-opt.squashfs 71:49
soak:
-7 (cache=1g)
Passed with 542 of 542 combinations.
real 75m21.191s
user 68m3.903s
sys 6m21.020s
-9 (cache=1g)
Passed with 542 of 542 combinations.
real 118m48.371s
user 107m35.685s
sys 7m16.438s
squashfs-opt
real 81m36.957s
user 62m37.369s
sys 20m52.367s
-1 (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 2m19.927s
user 0m16.813s
sys 2m4.293s
-7 (cache=2g)
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 2m24.346s
user 0m17.007s
sys 1m59.823s
squash-default
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 8m41.594s
user 1m25.346s
sys 19m12.036s
squash-opt
mhx@gimli ~ $ time find tmp/mount/ -type f | xargs -n 1 -P 32 -d $'\n' -I {} dd of=/dev/null if={} bs=64K status=none
real 141m41.092s
user 1m12.650s
sys 59m18.194s