/[gentoo]/xml/htdocs/proj/en/glep/glep-0025.html
Gentoo

Contents of /xml/htdocs/proj/en/glep/glep-0025.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1.3 - (show annotations) (download) (as text)
Fri Apr 1 01:32:29 2005 UTC (9 years, 4 months ago) by g2boojum
Branch: MAIN
Changes since 1.2: +14 -15 lines
File MIME type: text/html
fixo

1 <?xml version="1.0" encoding="utf-8" ?>
2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
4 <!--
5 This HTML is auto-generated. DO NOT EDIT THIS FILE! If you are writing a new
6 PEP, see http://www.python.org/peps/pep-0001.html for instructions and links
7 to templates. DO NOT USE THIS HTML FILE AS YOUR TEMPLATE!
8 -->
9 <head>
10 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
11 <meta name="generator" content="Docutils 0.3.7: http://docutils.sourceforge.net/" />
12 <title>GLEP 25 -- Distfile Patching Support</title>
13 <link rel="stylesheet" href="tools/glep.css" type="text/css" />
14 </head>
15 <body bgcolor="white">
16 <table class="navigation" cellpadding="0" cellspacing="0"
17 width="100%" border="0">
18 <tr><td class="navicon" width="150" height="35">
19 <a href="http://www.gentoo.org/" title="Gentoo Linux Home Page">
20 <img src="http://www.gentoo.org/images/gentoo-new.gif" alt="[Gentoo]"
21 border="0" width="150" height="35" /></a></td>
22 <td class="textlinks" align="left">
23 [<b><a href="http://www.gentoo.org/">Gentoo Linux Home</a></b>]
24 [<b><a href="http://www.gentoo.org/proj/en/glep">GLEP Index</a></b>]
25 [<b><a href="./glep-0025.txt">GLEP Source</a></b>]
26 </td></tr></table>
27 <table class="rfc2822 docutils field-list" frame="void" rules="none">
28 <col class="field-name" />
29 <col class="field-body" />
30 <tbody valign="top">
31 <tr class="field"><th class="field-name">GLEP:</th><td class="field-body">25</td>
32 </tr>
33 <tr class="field"><th class="field-name">Title:</th><td class="field-body">Distfile Patching Support</td>
34 </tr>
35 <tr class="field"><th class="field-name">Version:</th><td class="field-body">1.3</td>
36 </tr>
37 <tr class="field"><th class="field-name">Last-Modified:</th><td class="field-body"><a class="reference" href="http://www.gentoo.org/cgi-bin/viewcvs/xml/htdocs/proj/en/glep/glep-0025.txt?cvsroot=gentoo">2005/04/01 01:32:19</a></td>
38 </tr>
39 <tr class="field"><th class="field-name">Author:</th><td class="field-body">Brian Harring &lt;ferringb&#32;&#97;t&#32;gentoo.org&gt;</td>
40 </tr>
41 <tr class="field"><th class="field-name">Status:</th><td class="field-body">deferred</td>
42 </tr>
43 <tr class="field"><th class="field-name">Type:</th><td class="field-body">Standards Track</td>
44 </tr>
45 <tr class="field"><th class="field-name">Content-Type:</th><td class="field-body"><a class="reference" href="glep-0012.html">text/x-rst</a></td>
46 </tr>
47 <tr class="field"><th class="field-name">Created:</th><td class="field-body">6-Mar-2004</td>
48 </tr>
49 <tr class="field"><th class="field-name">Post-History:</th><td class="field-body">4-Apr-2004, 11-Nov-2004</td>
50 </tr>
51 </tbody>
52 </table>
53 <hr />
54 <div class="contents topic" id="contents">
55 <p class="topic-title first"><a name="contents">Contents</a></p>
56 <ul class="simple">
57 <li><a class="reference" href="#abstract" id="id7" name="id7">Abstract</a></li>
58 <li><a class="reference" href="#status" id="id8" name="id8">Status</a></li>
59 <li><a class="reference" href="#motivation" id="id9" name="id9">Motivation</a></li>
60 <li><a class="reference" href="#binary-patches-vs-gnudiff-patches" id="id10" name="id10">Binary patches vs GNUDiff patches</a></li>
61 <li><a class="reference" href="#rationale" id="id11" name="id11">Rationale</a></li>
62 <li><a class="reference" href="#specification" id="id12" name="id12">Specification</a><ul>
63 <li><a class="reference" href="#additions-to-the-tree" id="id13" name="id13">Additions to the tree</a></li>
64 <li><a class="reference" href="#portage-implementation" id="id14" name="id14">Portage Implementation</a><ul>
65 <li><a class="reference" href="#fetching" id="id15" name="id15">Fetching</a></li>
66 <li><a class="reference" href="#reconstruction" id="id16" name="id16">Reconstruction</a></li>
67 <li><a class="reference" href="#compressed-md5sums" id="id17" name="id17">Compressed MD5sums:</a><ul>
68 <li><a class="reference" href="#the-problem-in-detail" id="id18" name="id18">The Problem in Detail</a></li>
69 <li><a class="reference" href="#the-proposed-solution" id="id19" name="id19">The Proposed Solution</a></li>
70 </ul>
71 </li>
72 </ul>
73 </li>
74 <li><a class="reference" href="#distfile-mirror-additions" id="id20" name="id20">Distfile Mirror Additions</a></li>
75 <li><a class="reference" href="#patch-creation" id="id21" name="id21">Patch Creation</a></li>
76 </ul>
77 </li>
78 <li><a class="reference" href="#backwards-compatibility" id="id22" name="id22">Backwards Compatibility</a></li>
79 <li><a class="reference" href="#reference-implementation" id="id23" name="id23">Reference Implementation</a></li>
80 <li><a class="reference" href="#references" id="id24" name="id24">References</a></li>
81 <li><a class="reference" href="#copyright" id="id25" name="id25">Copyright</a></li>
82 </ul>
83 </div>
84 <div class="section" id="abstract">
85 <h1><a class="toc-backref" href="#id7" name="abstract">Abstract</a></h1>
86 <p>The intention of this GLEP is to propose the creation of patching support for
87 portage, and iron out the implementation details.</p>
88 </div>
89 <div class="section" id="status">
90 <h1><a class="toc-backref" href="#id8" name="status">Status</a></h1>
91 <p>Timed out</p>
92 </div>
93 <div class="section" id="motivation">
94 <h1><a class="toc-backref" href="#id9" name="motivation">Motivation</a></h1>
95 <p>Reduce the bandwidth load placed on our mirrors by decreasing the amount of
96 bytes transferred when upgrading between versions. Side benefit of this is to
97 significantly decrease the download requirements for users lacking broadband.</p>
98 </div>
99 <div class="section" id="binary-patches-vs-gnudiff-patches">
100 <h1><a class="toc-backref" href="#id10" name="binary-patches-vs-gnudiff-patches">Binary patches vs GNUDiff patches</a></h1>
101 <p>Most people are familiar with diff patches (unified diff for example)- this
102 glep is specifically proposing the use of an actual binary differencer. The
103 reason for this is that diff patches are line based- you change a single
104 character in a line, and the whole line must be included in the patch. Binary
105 differencers work at the byte level- it encodes just that byte. In that
106 respect binary patches are often much more efficient then diff patches.</p>
107 <p>Further, the ability to reverse a unified patch is due to the fact the diff
108 includes <strong>both</strong> the original line, and the modified line. The author isn't
109 aware of any binary differencer that is able to create patches the can be
110 reversed- basically they're unidirectional, the patch that is generated can
111 only be used to upgrade or downgrade the version, not both. The plus side of
112 this limitation is a significantly decreased patch size.</p>
113 <p>The choice of binary patches over diff patches pretty much comes down to the
114 fact they're smaller- example being a kdelibs binary patch for 3.1.4-&gt;3.1.5 is
115 75kb, the equivalent diff patch is 123kb, and is unable to result in a correct
116 md5 <a class="footnote-reference" href="#id4" id="id1" name="id1">[1]</a>.</p>
117 <p>Currently, this glep is proposing only the usage of binary patches- that's not
118 to say (with a fair amount of work) it couldn't be extended to support
119 standard diffs.</p>
120 </div>
121 <div class="section" id="rationale">
122 <h1><a class="toc-backref" href="#id11" name="rationale">Rationale</a></h1>
123 <p>The difference between source releases typically isn't very large, especially
124 for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and
125 kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is
126 75.6 kb <a class="footnote-reference" href="#id5" id="id2" name="id2">[2]</a>, less then 1% the size of 3.1.5's tbz2.</p>
127 </div>
128 <div class="section" id="specification">
129 <h1><a class="toc-backref" href="#id12" name="specification">Specification</a></h1>
130 <p>Quite a few sections of gentoo are affected- mirroring, the portage tree, and
131 portage itself.</p>
132 <div class="section" id="additions-to-the-tree">
133 <h2><a class="toc-backref" href="#id13" name="additions-to-the-tree">Additions to the tree</a></h2>
134 <p>For adding patch info into the tree, this glep proposes a global patch list
135 (stored in profiles as patches.global), and individual patch lists stored in
136 relevant package directories (named patches). Using the kernel packages as an
137 example, a global list of patches enables us to create a patch once, add an
138 entry, and have all kernel packages benefit from that single entry. Both
139 patches.global, and individual package patch files share the same format:</p>
140 <pre class="literal-block">
141 MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size
142 </pre>
143 <p>For those familiar with digest file layout, this should look familiar.
144 Essentially, chksum type, value, filename, size. The UMD5 chksum type is just
145 the uncompressed md5/size of the file- so if the UMD5 were for a bzip2
146 compressed file, it would be the md5 value/size of the uncompressed file.
147 And an example:</p>
148 <pre class="literal-block">
149 MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320
150 </pre>
151 <p>In the above example, the md5sum of
152 <a class="reference" href="http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2">http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2</a> is
153 calculated, compared to the stored value, and then the file size is checked.
154 The one difference is the UMD5 checksum type- the md5 value and the size are
155 specific to the <em>uncompressed</em> file. Continuing, for cases where the patch
156 will reside on one of our mirrors, the patch filename would be sufficient.</p>
157 <p>Finally, note that this is a unidirectional patch- using the above example,
158 kdelibs-3.1.4-3.1.5 can <strong>only</strong> be used to upgrade from 3.1.4 to 3.1.5, not
159 in reverse (originally explained in <a class="reference" href="#binary-patches-vs-gnudiff-patches">Binary patches vs GNUDiff patches</a>).</p>
160 </div>
161 <div class="section" id="portage-implementation">
162 <h2><a class="toc-backref" href="#id14" name="portage-implementation">Portage Implementation</a></h2>
163 <p>This glep proposes the patching support should be (at this stage) optional-
164 specifically, enabled via FEATURES=&quot;patching&quot;.</p>
165 <div class="section" id="fetching">
166 <h3><a class="toc-backref" href="#id15" name="fetching">Fetching</a></h3>
167 <p>When patching is enabled, the global patch list is read, and the packages
168 patch list is read. From there, portage determines what files could be used
169 as a base for patching to the desired file- further, determining if it's
170 actually worth patching (case where it wouldn't be is when the target file is
171 less then the sum of the patches needed). Any patches to be used are fetched,
172 and md5 verified.</p>
173 </div>
174 <div class="section" id="reconstruction">
175 <h3><a class="toc-backref" href="#id16" name="reconstruction">Reconstruction</a></h3>
176 <p>Upon fetching and md5 verification of patch(es), the desired file is
177 reconstructed. Assuming reconstruction didn't return any errors, the target
178 file has its uncompressed md5sum calculated and verified, then is recompressed
179 and the compressed md5sum calculated. At this point, if the compressed md5
180 matches the md5 stored in the tree, then portage transfers the file into
181 distfiles, and continues on it's merry way.</p>
182 <p>If the compressed md5 is different from the tree's value, then the (proposed)
183 md5 database is updated with new compressed md5. Details of this database
184 (and the issue it addresses) follow.</p>
185 </div>
186 <div class="section" id="compressed-md5sums">
187 <h3><a class="toc-backref" href="#id17" name="compressed-md5sums">Compressed MD5sums:</a></h3>
188 <p>There will be instances where a file is reconstructed perfectly, recompressed,
189 and the recompressed md5sum differs from what is stored in the tree- the
190 problem is that the md5sum of a compressed file is inherently tied to the
191 compressor version/options used to compress the original source.</p>
192 <div class="section" id="the-problem-in-detail">
193 <h4><a class="toc-backref" href="#id18" name="the-problem-in-detail">The Problem in Detail</a></h4>
194 <p>A good example of this problem is related to bzip2 versions used for
195 compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in
196 the compressor resulting in a slightly better compression result- end result
197 being a different file, eg a different md5sum. Assuming compressor versions
198 are the same, there also is the issue of what compression level the target
199 source was originally compressed at- was it compressed with -9, -8 or -7?
200 That's just a sampling of the various original settings that must be accounted
201 for, and that's limited to gzip/bzip2; other compressors will add to the
202 number of variables to be accounted for to produce an exact recreation of the
203 compressed md5sum.</p>
204 <p>Tracking the compressor version and options originally used isn't really a
205 valid option- assuming all options were accounted for, clients would still be
206 required to have multiple versions of the same compressor installed just for
207 the sake of recreating a compressed md5sum <em>even though</em> the uncompressed
208 source's md5 has already been verified.</p>
209 </div>
210 <div class="section" id="the-proposed-solution">
211 <h4><a class="toc-backref" href="#id19" name="the-proposed-solution">The Proposed Solution</a></h4>
212 <p>The creation of a clientside flatfile/db of valid alternate md5/size pairs
213 would enable portage to handle perfectly reconstructed files, that have a
214 different md5sum due to compression differences. The proposed format is thus:</p>
215 <pre class="literal-block">
216 MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size
217 </pre>
218 <p>Example:</p>
219 <pre class="literal-block">
220 MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187
221 </pre>
222 <p>An alternate md5/size pair for a file would be added <strong>only</strong> when the
223 uncompressed source's md5/size has been verified, yet upon recompression the
224 md5 differs. For cleansing of older md5/size pairs from this db, a utility
225 would be required- the author suggests the addition of a distfiles-cleaning
226 utility to portage, with the ability to also cleanse old md5/size pairs when
227 the file the pair was created for no longer exists in distfiles.</p>
228 <p>Where to store the database is debatable- /etc/portage or /var/cache/edb are
229 definite options.</p>
230 <p>The reasoning for allowing for an optional new-name is that it provides needed
231 functionality should anyone attempt to extend portage to allow for clients to
232 change the compression used for a source (eg, recompress all gzip files as
233 bzip2). Granted, no such code or attempt has been made, but nothing is lost
234 by leaving the option open should the request/attempt be made.</p>
235 <p>A potential gotcha of adding this support is that in environments where the
236 distfiles directory is shared out to multiple systems, this db must be shared
237 also.</p>
238 </div>
239 </div>
240 </div>
241 <div class="section" id="distfile-mirror-additions">
242 <h2><a class="toc-backref" href="#id20" name="distfile-mirror-additions">Distfile Mirror Additions</a></h2>
243 <p>One issue of contention is where these files will actually be stored. As of
244 the writing of this glep, a full distfiles mirror is roughly around 40 gb- a
245 rough estimate by the author places the space requirements for patches for
246 each version at a total of around 4gb. Note this isn't even remotely a hard
247 figure yet, and a better figure is being checked into currently.</p>
248 <p>Regardless of the exact space figure, finding a place to store the patches
249 will be problematic. Expansion of the required mirror space (essentially just
250 swallowing the patches storage requirement) is unlikely, since it was one of
251 the main arguements against the now defunct glep9 attempt <a class="footnote-reference" href="#id5" id="id3" name="id3">[2]</a>. A couple of
252 ideas that have been put forth to handle the additional space requirements are
253 as follows-</p>
254 <p>1) Identification of mirrors willing to handle the extra space requirements-
255 essentially create an additional patch mirror tier.</p>
256 <p>2) Mirroring only a patch for certain package versions, rather then full
257 source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored
258 (rather then the full 10.53 MB source). Downside to this approach is that a
259 user who is downloading kdelibs for the first time would either need to pull
260 it from the original SRC_URI (placing the burden onto the upstream mirror), or
261 pull the 3.1.4 version, and the patch- pulling 63k more then if they had just
262 pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an
263 optimal case- not all versions will have such miniscule patches.</p>
264 <p>3) A variation on the idea above, essentially mirroring only the patch for
265 the oldest version(s) of a package; eg, kdelibs currently has version 3.05,
266 3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not
267 full source (think RESTRICT=&quot;fetch&quot;). One plus to this is that patches to
268 downgrade in version are smaller then the patches to upgrade in version- there
269 are exceptions to this, but they're hard to find. A major downside to this
270 approach is A) a user would have to sync up to get the patchlists for that
271 version, B) creation of a set of patches to go backwards in version (see
272 <a class="reference" href="#binary-patches-vs-gnudiff-patches">Binary patches vs GNUDiff patches</a>)..</p>
273 <p>Of the options listed above, the first is the easiest, although the second
274 could be made to work. Feedback and any possible alternatives would be
275 greatly appreciated.</p>
276 </div>
277 <div class="section" id="patch-creation">
278 <h2><a class="toc-backref" href="#id21" name="patch-creation">Patch Creation</a></h2>
279 <p>Maintenance of patch lists, and the actual patch creation ought to be managed
280 by a high level script- essentally a dev says &quot;I want a patch between this
281 version, and that version: make it so&quot;, the script churns away
282 creating/updating the patch list, and generating the patch locally. The
283 utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org
284 (exempting if it's not a locally generated patch), and repoman is used to
285 commit the updated patch list.</p>
286 <p>What would be preferable (although possibly wishful thinking), is if hardware
287 could be co-opted for automatic patch generation, rather then forcing it upon
288 the devs- something akin to how files are pulled onto the mirror automatically
289 for new ebuilds.</p>
290 <p>The initial bulk of patches to get will be generated by the author, to ease
291 the transition and offer patches for people to test out.</p>
292 </div>
293 </div>
294 <div class="section" id="backwards-compatibility">
295 <h1><a class="toc-backref" href="#id22" name="backwards-compatibility">Backwards Compatibility</a></h1>
296 <p>As noted in <a class="reference" href="#the-proposed-solution">The Proposed Solution</a>, a system using patching and sharing out
297 it's distfiles must share out it's alternate md5 db. Any system that uses the
298 distfiles share must support the alternate md5 db also. If this is considered
299 enough of an issue, it is conceivable to place reconstructed sources with an
300 alternate md5 into a subdirectory of distdir- portage only looks within
301 distdir, unwilling to descend into subdirectories.</p>
302 <p>Also note that <a class="reference" href="#distfile-mirror-additions">Distfile Mirror Additions</a> may add additional backwards
303 compatibility issues, depending on what solution is accepted.</p>
304 </div>
305 <div class="section" id="reference-implementation">
306 <h1><a class="toc-backref" href="#id23" name="reference-implementation">Reference Implementation</a></h1>
307 <p>TODO</p>
308 </div>
309 <div class="section" id="references">
310 <h1><a class="toc-backref" href="#id24" name="references">References</a></h1>
311 <table class="docutils footnote" frame="void" id="id4" rules="none">
312 <colgroup><col class="label" /><col /></colgroup>
313 <tbody valign="top">
314 <tr><td class="label"><a class="fn-backref" href="#id1" name="id4">[1]</a></td><td><a class="reference" href="http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5">http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5</a>.{patch,diff}.bz2.</td></tr>
315 </tbody>
316 </table>
317 <table class="docutils footnote" frame="void" id="id5" rules="none">
318 <colgroup><col class="label" /><col /></colgroup>
319 <tbody valign="top">
320 <tr><td class="label"><a name="id5">[2]</a></td><td><em>(<a class="fn-backref" href="#id2">1</a>, <a class="fn-backref" href="#id3">2</a>)</em> kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at <a class="reference" href="http://sourceforge.net/projects/diffball">http://sourceforge.net/projects/diffball</a>)
321 Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at <a class="reference" href="http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2">http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2</a> for those curious.</td></tr>
322 </tbody>
323 </table>
324 <table class="docutils footnote" frame="void" id="id6" rules="none">
325 <colgroup><col class="label" /><col /></colgroup>
326 <tbody valign="top">
327 <tr><td class="label"><a name="id6">[3]</a></td><td>Glep9, 'Gentoo Package Update System'
328 (<a class="reference" href="http://glep.gentoo.org/glep-0009.html">http://glep.gentoo.org/glep-0009.html</a>)</td></tr>
329 </tbody>
330 </table>
331 </div>
332 <div class="section" id="copyright">
333 <h1><a class="toc-backref" href="#id25" name="copyright">Copyright</a></h1>
334 <p>This document has been placed in the public domain.</p>
335 </div>
336
337 </div>
338 <hr class="docutils footer" />
339 <div class="footer">
340 <a class="reference" href="glep-0025.txt">View document source</a>.
341 Generated on: 2005-04-01 01:32 UTC.
342 Generated by <a class="reference" href="http://docutils.sourceforge.net/">Docutils</a> from <a class="reference" href="http://docutils.sourceforge.net/rst.html">reStructuredText</a> source.
343 </div>
344 </body>
345 </html>

  ViewVC Help
Powered by ViewVC 1.1.20