1 |
g2boojum |
1.1 |
GLEP: 25 |
2 |
|
|
Title: Distfile Patching Support |
3 |
g2boojum |
1.2 |
Version: $Revision: 1.1 $ |
4 |
|
|
Last-Modified: $Date: 2004/04/04 22:56:06 $ |
5 |
g2boojum |
1.1 |
Author: Brian Harring <ferringb@gentoo.org> |
6 |
g2boojum |
1.2 |
Status: deferred |
7 |
g2boojum |
1.1 |
Type: Standards Track |
8 |
|
|
Content-Type: text/x-rst |
9 |
|
|
Created: 6-Mar-2004 |
10 |
g2boojum |
1.2 |
Post-History: 4-Apr-2004, 11-Nov-2004 |
11 |
g2boojum |
1.1 |
|
12 |
|
|
Abstract |
13 |
|
|
======== |
14 |
|
|
|
15 |
|
|
The intention of this GLEP is to propose the creation of patching support for |
16 |
|
|
portage, and iron out the implementation details. |
17 |
|
|
|
18 |
g2boojum |
1.2 |
Status |
19 |
|
|
====== |
20 |
|
|
|
21 |
|
|
Timed out |
22 |
|
|
|
23 |
|
|
|
24 |
g2boojum |
1.1 |
Motivation |
25 |
|
|
========== |
26 |
|
|
|
27 |
|
|
Reduce the bandwidth load placed on our mirrors by decreasing the amount of |
28 |
|
|
bytes transferred when upgrading between versions. Side benefit of this is to |
29 |
|
|
significantly decrease the download requirements for users lacking broadband. |
30 |
|
|
|
31 |
|
|
Binary patches vs GNUDiff patches |
32 |
|
|
================================= |
33 |
|
|
|
34 |
|
|
Most people are familiar with diff patches (unified diff for example)- this |
35 |
|
|
glep is specifically proposing the use of an actual binary differencer. The |
36 |
|
|
reason for this is that diff patches are line based- you change a single |
37 |
|
|
character in a line, and the whole line must be included in the patch. Binary |
38 |
|
|
differencers work at the byte level- it encodes just that byte. In that |
39 |
|
|
respect binary patches are often much more efficient then diff patches. |
40 |
|
|
|
41 |
|
|
Further, the ability to reverse a unified patch is due to the fact the diff |
42 |
|
|
includes **both** the original line, and the modified line. The author isn't |
43 |
|
|
aware of any binary differencer that is able to create patches the can be |
44 |
|
|
reversed- basically they're unidirectional, the patch that is generated can |
45 |
|
|
only be used to upgrade or downgrade the version, not both. The plus side of |
46 |
|
|
this limitation is a significantly decreased patch size. |
47 |
|
|
|
48 |
|
|
The choice of binary patches over diff patches pretty much comes down to the |
49 |
|
|
fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is |
50 |
|
|
75kb, the equivalent diff patch is 123kb, and is unable to result in a correct |
51 |
|
|
md5 [1]_. |
52 |
|
|
|
53 |
|
|
Currently, this glep is proposing only the usage of binary patches- that's not |
54 |
|
|
to say (with a fair amount of work) it couldn't be extended to support |
55 |
|
|
standard diffs. |
56 |
|
|
|
57 |
|
|
Rationale |
58 |
|
|
========= |
59 |
|
|
|
60 |
|
|
The difference between source releases typically isn't very large, especially |
61 |
|
|
for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and |
62 |
|
|
kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is |
63 |
|
|
75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2. |
64 |
|
|
|
65 |
|
|
Specification |
66 |
|
|
============= |
67 |
|
|
|
68 |
|
|
Quite a few sections of gentoo are affected- mirroring, the portage tree, and |
69 |
|
|
portage itself. |
70 |
|
|
|
71 |
|
|
Additions to the tree |
72 |
|
|
--------------------- |
73 |
|
|
|
74 |
|
|
For adding patch info into the tree, this glep proposes a global patch list |
75 |
|
|
(stored in profiles as patches.global), and individual patch lists stored in |
76 |
|
|
relevant package directories (named patches). Using the kernel packages as an |
77 |
|
|
example, a global list of patches enables us to create a patch once, add an |
78 |
|
|
entry, and have all kernel packages benefit from that single entry. Both |
79 |
|
|
patches.global, and individual package patch files share the same format: |
80 |
|
|
|
81 |
|
|
:: |
82 |
|
|
|
83 |
|
|
MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size |
84 |
|
|
|
85 |
|
|
For those familiar with digest file layout, this should look familiar. |
86 |
|
|
Essentially, chksum type, value, filename, size. The UMD5 chksum type is just |
87 |
|
|
the uncompressed md5/size of the file- so if the UMD5 were for a bzip2 |
88 |
|
|
compressed file, it would be the md5 value/size of the uncompressed file. |
89 |
|
|
And an example: |
90 |
|
|
|
91 |
|
|
:: |
92 |
|
|
|
93 |
|
|
MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320 |
94 |
|
|
|
95 |
|
|
In the above example, the md5sum of |
96 |
|
|
http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is |
97 |
|
|
calculated, compared to the stored value, and then the file size is checked. |
98 |
|
|
The one difference is the UMD5 checksum type- the md5 value and the size are |
99 |
|
|
specific to the *uncompressed* file. Continuing, for cases where the patch |
100 |
|
|
will reside on one of our mirrors, the patch filename would be sufficient. |
101 |
|
|
|
102 |
|
|
Finally, note that this is a unidirectional patch- using the above example, |
103 |
|
|
kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not |
104 |
|
|
in reverse (originally explained in `Binary patches vs GNUDiff patches`_). |
105 |
|
|
|
106 |
|
|
Portage Implementation |
107 |
|
|
---------------------- |
108 |
|
|
|
109 |
|
|
This glep proposes the patching support should be (at this stage) optional- |
110 |
|
|
specifically, enabled via FEATURES="patching". |
111 |
|
|
|
112 |
|
|
Fetching |
113 |
|
|
'''''''' |
114 |
|
|
|
115 |
|
|
When patching is enabled, the global patch list is read, and the packages |
116 |
|
|
patch list is read. From there, portage determines what files could be used |
117 |
|
|
as a base for patching to the desired file- further, determining if it's |
118 |
|
|
actually worth patching (case where it wouldn't be is when the target file is |
119 |
|
|
less then the sum of the patches needed). Any patches to be used are fetched, |
120 |
|
|
and md5 verified. |
121 |
|
|
|
122 |
|
|
Reconstruction |
123 |
|
|
'''''''''''''' |
124 |
|
|
|
125 |
|
|
Upon fetching and md5 verification of patch(es), the desired file is |
126 |
|
|
reconstructed. Assuming reconstruction didn't return any errors, the target |
127 |
|
|
file has its uncompressed md5sum calculated and verified, then is recompressed |
128 |
|
|
and the compressed md5sum calculated. At this point, if the compressed md5 |
129 |
|
|
matches the md5 stored in the tree, then portage transfers the file into |
130 |
|
|
distfiles, and continues on it's merry way. |
131 |
|
|
|
132 |
|
|
If the compressed md5 is different from the tree's value, then the (proposed) |
133 |
|
|
md5 database is updated with new compressed md5. Details of this database |
134 |
|
|
(and the issue it addresses) follow. |
135 |
|
|
|
136 |
|
|
Compressed MD5sums: |
137 |
|
|
''''''''''''''''''' |
138 |
|
|
|
139 |
|
|
There will be instances where a file is reconstructed perfectly, recompressed, |
140 |
|
|
and the recompressed md5sum differs from what is stored in the tree- the |
141 |
|
|
problem is that the md5sum of a compressed file is inherently tied to the |
142 |
|
|
compressor version/options used to compress the original source. |
143 |
|
|
|
144 |
|
|
===================== |
145 |
|
|
The Problem in Detail |
146 |
|
|
===================== |
147 |
|
|
|
148 |
|
|
A good example of this problem is related to bzip2 versions used for |
149 |
|
|
compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in |
150 |
|
|
the compressor resulting in a slightly better compression result- end result |
151 |
|
|
being a different file, eg a different md5sum. Assuming compressor versions |
152 |
|
|
are the same, there also is the issue of what compression level the target |
153 |
|
|
source was originally compressed at- was it compressed with -9, -8 or -7? |
154 |
|
|
That's just a sampling of the various original settings that must be accounted |
155 |
|
|
for, and that's limited to gzip/bzip2; other compressors will add to the |
156 |
|
|
number of variables to be accounted for to produce an exact recreation of the |
157 |
|
|
compressed md5sum. |
158 |
|
|
|
159 |
|
|
Tracking the compressor version and options originally used isn't really a |
160 |
|
|
valid option- assuming all options were accounted for, clients would still be |
161 |
|
|
required to have multiple versions of the same compressor installed just for |
162 |
|
|
the sake of recreating a compressed md5sum *even though* the uncompressed |
163 |
|
|
source's md5 has already been verified. |
164 |
|
|
|
165 |
|
|
===================== |
166 |
|
|
The Proposed Solution |
167 |
|
|
===================== |
168 |
|
|
|
169 |
|
|
The creation of a clientside flatfile/db of valid alternate md5/size pairs |
170 |
|
|
would enable portage to handle perfectly reconstructed files, that have a |
171 |
|
|
different md5sum due to compression differences. The proposed format is thus: |
172 |
|
|
|
173 |
|
|
:: |
174 |
|
|
|
175 |
|
|
MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size |
176 |
|
|
|
177 |
|
|
Example: |
178 |
|
|
|
179 |
|
|
:: |
180 |
|
|
|
181 |
|
|
MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187 |
182 |
|
|
|
183 |
|
|
An alternate md5/size pair for a file would be added **only** when the |
184 |
|
|
uncompressed source's md5/size has been verified, yet upon recompression the |
185 |
|
|
md5 differs. For cleansing of older md5/size pairs from this db, a utility |
186 |
|
|
would be required- the author suggests the addition of a distfiles-cleaning |
187 |
|
|
utility to portage, with the ability to also cleanse old md5/size pairs when |
188 |
|
|
the file the pair was created for no longer exists in distfiles. |
189 |
|
|
|
190 |
|
|
Where to store the database is debatable- /etc/portage or /var/cache/edb are |
191 |
|
|
definite options. |
192 |
|
|
|
193 |
|
|
The reasoning for allowing for an optional new-name is that it provides needed |
194 |
|
|
functionality should anyone attempt to extend portage to allow for clients to |
195 |
|
|
change the compression used for a source (eg, recompress all gzip files as |
196 |
|
|
bzip2). Granted, no such code or attempt has been made, but nothing is lost |
197 |
|
|
by leaving the option open should the request/attempt be made. |
198 |
|
|
|
199 |
|
|
A potential gotcha of adding this support is that in environments where the |
200 |
|
|
distfiles directory is shared out to multiple systems, this db must be shared |
201 |
|
|
also. |
202 |
|
|
|
203 |
|
|
|
204 |
|
|
|
205 |
|
|
Distfile Mirror Additions |
206 |
|
|
------------------------- |
207 |
|
|
|
208 |
|
|
One issue of contention is where these files will actually be stored. As of |
209 |
|
|
the writing of this glep, a full distfiles mirror is roughly around 40 gb- a |
210 |
|
|
rough estimate by the author places the space requirements for patches for |
211 |
|
|
each version at a total of around 4gb. Note this isn't even remotely a hard |
212 |
|
|
figure yet, and a better figure is being checked into currently. |
213 |
|
|
|
214 |
|
|
Regardless of the exact space figure, finding a place to store the patches |
215 |
|
|
will be problematic. Expansion of the required mirror space (essentially just |
216 |
|
|
swallowing the patches storage requirement) is unlikely, since it was one of |
217 |
|
|
the main arguements against the now defunct glep9 attempt [2]_. A couple of |
218 |
|
|
ideas that have been put forth to handle the additional space requirements are |
219 |
|
|
as follows- |
220 |
|
|
|
221 |
|
|
1) Identification of mirrors willing to handle the extra space requirements- |
222 |
|
|
essentially create an additional patch mirror tier. |
223 |
|
|
|
224 |
|
|
2) Mirroring only a patch for certain package versions, rather then full |
225 |
|
|
source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored |
226 |
|
|
(rather then the full 10.53 MB source). Downside to this approach is that a |
227 |
|
|
user who is downloading kdelibs for the first time would either need to pull |
228 |
|
|
it from the original SRC_URI (placing the burden onto the upstream mirror), or |
229 |
|
|
pull the 3.1.4 version, and the patch- pulling 63k more then if they had just |
230 |
|
|
pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an |
231 |
|
|
optimal case- not all versions will have such miniscule patches. |
232 |
|
|
|
233 |
|
|
3) A variation on the idea above, essentially mirroring only the patch for |
234 |
|
|
the oldest version(s) of a package; eg, kdelibs currently has version 3.05, |
235 |
|
|
3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not |
236 |
|
|
full source (think RESTRICT="fetch"). One plus to this is that patches to |
237 |
|
|
downgrade in version are smaller then the patches to upgrade in version- there |
238 |
|
|
are exceptions to this, but they're hard to find. A major downside to this |
239 |
|
|
approach is A) a user would have to sync up to get the patchlists for that |
240 |
|
|
version, B) creation of a set of patches to go backwards in version (see |
241 |
|
|
`Binary patches vs GNUDiff patches`_).. |
242 |
|
|
|
243 |
|
|
Of the options listed above, the first is the easiest, although the second |
244 |
|
|
could be made to work. Feedback and any possible alternatives would be |
245 |
|
|
greatly appreciated. |
246 |
|
|
|
247 |
|
|
Patch Creation |
248 |
|
|
-------------- |
249 |
|
|
|
250 |
|
|
Maintenance of patch lists, and the actual patch creation ought to be managed |
251 |
|
|
by a high level script- essentally a dev says "I want a patch between this |
252 |
|
|
version, and that version: make it so", the script churns away |
253 |
|
|
creating/updating the patch list, and generating the patch locally. The |
254 |
|
|
utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org |
255 |
|
|
(exempting if it's not a locally generated patch), and repoman is used to |
256 |
|
|
commit the updated patch list. |
257 |
|
|
|
258 |
|
|
What would be preferable (although possibly wishful thinking), is if hardware |
259 |
|
|
could be co-opted for automatic patch generation, rather then forcing it upon |
260 |
|
|
the devs- something akin to how files are pulled onto the mirror automatically |
261 |
|
|
for new ebuilds. |
262 |
|
|
|
263 |
|
|
The initial bulk of patches to get will be generated by the author, to ease |
264 |
|
|
the transition and offer patches for people to test out. |
265 |
|
|
|
266 |
|
|
Backwards Compatability |
267 |
|
|
======================= |
268 |
|
|
|
269 |
|
|
As noted in `The Proposed Solution`_, a system using patching and sharing out |
270 |
|
|
it's distfiles must share out it's alternate md5 db. Any system that uses the |
271 |
|
|
distfiles share must support the alternate md5 db also. If this is considered |
272 |
|
|
enough of an issue, it is conceivable to place reconstructed sources with an |
273 |
|
|
alternate md5 into a subdirectory of distdir- portage only looks within |
274 |
|
|
distdir, unwilling to descend into subdirectories. |
275 |
|
|
|
276 |
|
|
Also note that `Distfile Mirror Additions`_ may add additional backwards |
277 |
|
|
compatability issues, depending on what solution is accepted. |
278 |
|
|
|
279 |
|
|
Reference Implementation |
280 |
|
|
======================== |
281 |
|
|
|
282 |
|
|
TODO |
283 |
|
|
|
284 |
|
|
References |
285 |
|
|
========== |
286 |
|
|
.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2. |
287 |
|
|
.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball) |
288 |
|
|
Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious. |
289 |
|
|
.. [3] Glep9, 'Gentoo Package Update System' |
290 |
|
|
(http://glep.gentoo.org/glep-0009.html) |
291 |
|
|
|
292 |
|
|
Copyright |
293 |
|
|
========= |
294 |
|
|
|
295 |
|
|
This document has been placed in the public domain. |