/[gentoo]/xml/htdocs/proj/en/glep/glep-0025.txt
Gentoo

Contents of /xml/htdocs/proj/en/glep/glep-0025.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (hide annotations) (download)
Fri Apr 1 01:32:19 2005 UTC (9 years, 9 months ago) by g2boojum
Branch: MAIN
CVS Tags: HEAD
Changes since 1.2: +4 -4 lines
File MIME type: text/plain
spelling fix

1 g2boojum 1.1 GLEP: 25
2     Title: Distfile Patching Support
3 g2boojum 1.3 Version: $Revision: 1.2 $
4     Last-Modified: $Date: 2004/11/11 21:34:36 $
5 g2boojum 1.1 Author: Brian Harring <ferringb@gentoo.org>
6 g2boojum 1.2 Status: deferred
7 g2boojum 1.1 Type: Standards Track
8     Content-Type: text/x-rst
9     Created: 6-Mar-2004
10 g2boojum 1.2 Post-History: 4-Apr-2004, 11-Nov-2004
11 g2boojum 1.1
12     Abstract
13     ========
14    
15     The intention of this GLEP is to propose the creation of patching support for
16     portage, and iron out the implementation details.
17    
18 g2boojum 1.2 Status
19     ======
20    
21     Timed out
22    
23    
24 g2boojum 1.1 Motivation
25     ==========
26    
27     Reduce the bandwidth load placed on our mirrors by decreasing the amount of
28     bytes transferred when upgrading between versions. Side benefit of this is to
29     significantly decrease the download requirements for users lacking broadband.
30    
31     Binary patches vs GNUDiff patches
32     =================================
33    
34     Most people are familiar with diff patches (unified diff for example)- this
35     glep is specifically proposing the use of an actual binary differencer. The
36     reason for this is that diff patches are line based- you change a single
37     character in a line, and the whole line must be included in the patch. Binary
38     differencers work at the byte level- it encodes just that byte. In that
39     respect binary patches are often much more efficient then diff patches.
40    
41     Further, the ability to reverse a unified patch is due to the fact the diff
42     includes **both** the original line, and the modified line. The author isn't
43     aware of any binary differencer that is able to create patches the can be
44     reversed- basically they're unidirectional, the patch that is generated can
45     only be used to upgrade or downgrade the version, not both. The plus side of
46     this limitation is a significantly decreased patch size.
47    
48     The choice of binary patches over diff patches pretty much comes down to the
49     fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is
50     75kb, the equivalent diff patch is 123kb, and is unable to result in a correct
51     md5 [1]_.
52    
53     Currently, this glep is proposing only the usage of binary patches- that's not
54     to say (with a fair amount of work) it couldn't be extended to support
55     standard diffs.
56    
57     Rationale
58     =========
59    
60     The difference between source releases typically isn't very large, especially
61     for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and
62     kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is
63     75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2.
64    
65     Specification
66     =============
67    
68     Quite a few sections of gentoo are affected- mirroring, the portage tree, and
69     portage itself.
70    
71     Additions to the tree
72     ---------------------
73    
74     For adding patch info into the tree, this glep proposes a global patch list
75     (stored in profiles as patches.global), and individual patch lists stored in
76     relevant package directories (named patches). Using the kernel packages as an
77     example, a global list of patches enables us to create a patch once, add an
78     entry, and have all kernel packages benefit from that single entry. Both
79     patches.global, and individual package patch files share the same format:
80    
81     ::
82    
83     MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size
84    
85     For those familiar with digest file layout, this should look familiar.
86     Essentially, chksum type, value, filename, size. The UMD5 chksum type is just
87     the uncompressed md5/size of the file- so if the UMD5 were for a bzip2
88     compressed file, it would be the md5 value/size of the uncompressed file.
89     And an example:
90    
91     ::
92    
93     MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320
94    
95     In the above example, the md5sum of
96     http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is
97     calculated, compared to the stored value, and then the file size is checked.
98     The one difference is the UMD5 checksum type- the md5 value and the size are
99     specific to the *uncompressed* file. Continuing, for cases where the patch
100     will reside on one of our mirrors, the patch filename would be sufficient.
101    
102     Finally, note that this is a unidirectional patch- using the above example,
103     kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not
104     in reverse (originally explained in `Binary patches vs GNUDiff patches`_).
105    
106     Portage Implementation
107     ----------------------
108    
109     This glep proposes the patching support should be (at this stage) optional-
110     specifically, enabled via FEATURES="patching".
111    
112     Fetching
113     ''''''''
114    
115     When patching is enabled, the global patch list is read, and the packages
116     patch list is read. From there, portage determines what files could be used
117     as a base for patching to the desired file- further, determining if it's
118     actually worth patching (case where it wouldn't be is when the target file is
119     less then the sum of the patches needed). Any patches to be used are fetched,
120     and md5 verified.
121    
122     Reconstruction
123     ''''''''''''''
124    
125     Upon fetching and md5 verification of patch(es), the desired file is
126     reconstructed. Assuming reconstruction didn't return any errors, the target
127     file has its uncompressed md5sum calculated and verified, then is recompressed
128     and the compressed md5sum calculated. At this point, if the compressed md5
129     matches the md5 stored in the tree, then portage transfers the file into
130     distfiles, and continues on it's merry way.
131    
132     If the compressed md5 is different from the tree's value, then the (proposed)
133     md5 database is updated with new compressed md5. Details of this database
134     (and the issue it addresses) follow.
135    
136     Compressed MD5sums:
137     '''''''''''''''''''
138    
139     There will be instances where a file is reconstructed perfectly, recompressed,
140     and the recompressed md5sum differs from what is stored in the tree- the
141     problem is that the md5sum of a compressed file is inherently tied to the
142     compressor version/options used to compress the original source.
143    
144     =====================
145     The Problem in Detail
146     =====================
147    
148     A good example of this problem is related to bzip2 versions used for
149     compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in
150     the compressor resulting in a slightly better compression result- end result
151     being a different file, eg a different md5sum. Assuming compressor versions
152     are the same, there also is the issue of what compression level the target
153     source was originally compressed at- was it compressed with -9, -8 or -7?
154     That's just a sampling of the various original settings that must be accounted
155     for, and that's limited to gzip/bzip2; other compressors will add to the
156     number of variables to be accounted for to produce an exact recreation of the
157     compressed md5sum.
158    
159     Tracking the compressor version and options originally used isn't really a
160     valid option- assuming all options were accounted for, clients would still be
161     required to have multiple versions of the same compressor installed just for
162     the sake of recreating a compressed md5sum *even though* the uncompressed
163     source's md5 has already been verified.
164    
165     =====================
166     The Proposed Solution
167     =====================
168    
169     The creation of a clientside flatfile/db of valid alternate md5/size pairs
170     would enable portage to handle perfectly reconstructed files, that have a
171     different md5sum due to compression differences. The proposed format is thus:
172    
173     ::
174    
175     MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size
176    
177     Example:
178    
179     ::
180    
181     MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187
182    
183     An alternate md5/size pair for a file would be added **only** when the
184     uncompressed source's md5/size has been verified, yet upon recompression the
185     md5 differs. For cleansing of older md5/size pairs from this db, a utility
186     would be required- the author suggests the addition of a distfiles-cleaning
187     utility to portage, with the ability to also cleanse old md5/size pairs when
188     the file the pair was created for no longer exists in distfiles.
189    
190     Where to store the database is debatable- /etc/portage or /var/cache/edb are
191     definite options.
192    
193     The reasoning for allowing for an optional new-name is that it provides needed
194     functionality should anyone attempt to extend portage to allow for clients to
195     change the compression used for a source (eg, recompress all gzip files as
196     bzip2). Granted, no such code or attempt has been made, but nothing is lost
197     by leaving the option open should the request/attempt be made.
198    
199     A potential gotcha of adding this support is that in environments where the
200     distfiles directory is shared out to multiple systems, this db must be shared
201     also.
202    
203    
204    
205     Distfile Mirror Additions
206     -------------------------
207    
208     One issue of contention is where these files will actually be stored. As of
209     the writing of this glep, a full distfiles mirror is roughly around 40 gb- a
210     rough estimate by the author places the space requirements for patches for
211     each version at a total of around 4gb. Note this isn't even remotely a hard
212     figure yet, and a better figure is being checked into currently.
213    
214     Regardless of the exact space figure, finding a place to store the patches
215     will be problematic. Expansion of the required mirror space (essentially just
216     swallowing the patches storage requirement) is unlikely, since it was one of
217     the main arguements against the now defunct glep9 attempt [2]_. A couple of
218     ideas that have been put forth to handle the additional space requirements are
219     as follows-
220    
221     1) Identification of mirrors willing to handle the extra space requirements-
222     essentially create an additional patch mirror tier.
223    
224     2) Mirroring only a patch for certain package versions, rather then full
225     source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored
226     (rather then the full 10.53 MB source). Downside to this approach is that a
227     user who is downloading kdelibs for the first time would either need to pull
228     it from the original SRC_URI (placing the burden onto the upstream mirror), or
229     pull the 3.1.4 version, and the patch- pulling 63k more then if they had just
230     pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an
231     optimal case- not all versions will have such miniscule patches.
232    
233     3) A variation on the idea above, essentially mirroring only the patch for
234     the oldest version(s) of a package; eg, kdelibs currently has version 3.05,
235     3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not
236     full source (think RESTRICT="fetch"). One plus to this is that patches to
237     downgrade in version are smaller then the patches to upgrade in version- there
238     are exceptions to this, but they're hard to find. A major downside to this
239     approach is A) a user would have to sync up to get the patchlists for that
240     version, B) creation of a set of patches to go backwards in version (see
241     `Binary patches vs GNUDiff patches`_)..
242    
243     Of the options listed above, the first is the easiest, although the second
244     could be made to work. Feedback and any possible alternatives would be
245     greatly appreciated.
246    
247     Patch Creation
248     --------------
249    
250     Maintenance of patch lists, and the actual patch creation ought to be managed
251     by a high level script- essentally a dev says "I want a patch between this
252     version, and that version: make it so", the script churns away
253     creating/updating the patch list, and generating the patch locally. The
254     utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org
255     (exempting if it's not a locally generated patch), and repoman is used to
256     commit the updated patch list.
257    
258     What would be preferable (although possibly wishful thinking), is if hardware
259     could be co-opted for automatic patch generation, rather then forcing it upon
260     the devs- something akin to how files are pulled onto the mirror automatically
261     for new ebuilds.
262    
263     The initial bulk of patches to get will be generated by the author, to ease
264     the transition and offer patches for people to test out.
265    
266 g2boojum 1.3 Backwards Compatibility
267 g2boojum 1.1 =======================
268    
269     As noted in `The Proposed Solution`_, a system using patching and sharing out
270     it's distfiles must share out it's alternate md5 db. Any system that uses the
271     distfiles share must support the alternate md5 db also. If this is considered
272     enough of an issue, it is conceivable to place reconstructed sources with an
273     alternate md5 into a subdirectory of distdir- portage only looks within
274     distdir, unwilling to descend into subdirectories.
275    
276     Also note that `Distfile Mirror Additions`_ may add additional backwards
277 g2boojum 1.3 compatibility issues, depending on what solution is accepted.
278 g2boojum 1.1
279     Reference Implementation
280     ========================
281    
282     TODO
283    
284     References
285     ==========
286     .. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2.
287     .. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball)
288     Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious.
289     .. [3] Glep9, 'Gentoo Package Update System'
290     (http://glep.gentoo.org/glep-0009.html)
291    
292     Copyright
293     =========
294    
295     This document has been placed in the public domain.

  ViewVC Help
Powered by ViewVC 1.1.20