| 1 |
g2boojum |
1.1 |
GLEP: 25 |
| 2 |
|
|
Title: Distfile Patching Support |
| 3 |
|
|
Version: $Revision: 1.0 $ |
| 4 |
|
|
Last-Modified: $Date: 2004/04/01 01:07:40 $ |
| 5 |
|
|
Author: Brian Harring <ferringb@gentoo.org> |
| 6 |
|
|
Status: Draft |
| 7 |
|
|
Type: Standards Track |
| 8 |
|
|
Content-Type: text/x-rst |
| 9 |
|
|
Created: 6-Mar-2004 |
| 10 |
|
|
Post-History: 4-Apr-2004 |
| 11 |
|
|
|
| 12 |
|
|
Abstract |
| 13 |
|
|
======== |
| 14 |
|
|
|
| 15 |
|
|
The intention of this GLEP is to propose the creation of patching support for |
| 16 |
|
|
portage, and iron out the implementation details. |
| 17 |
|
|
|
| 18 |
|
|
Motivation |
| 19 |
|
|
========== |
| 20 |
|
|
|
| 21 |
|
|
Reduce the bandwidth load placed on our mirrors by decreasing the amount of |
| 22 |
|
|
bytes transferred when upgrading between versions. Side benefit of this is to |
| 23 |
|
|
significantly decrease the download requirements for users lacking broadband. |
| 24 |
|
|
|
| 25 |
|
|
Binary patches vs GNUDiff patches |
| 26 |
|
|
================================= |
| 27 |
|
|
|
| 28 |
|
|
Most people are familiar with diff patches (unified diff for example)- this |
| 29 |
|
|
glep is specifically proposing the use of an actual binary differencer. The |
| 30 |
|
|
reason for this is that diff patches are line based- you change a single |
| 31 |
|
|
character in a line, and the whole line must be included in the patch. Binary |
| 32 |
|
|
differencers work at the byte level- it encodes just that byte. In that |
| 33 |
|
|
respect binary patches are often much more efficient then diff patches. |
| 34 |
|
|
|
| 35 |
|
|
Further, the ability to reverse a unified patch is due to the fact the diff |
| 36 |
|
|
includes **both** the original line, and the modified line. The author isn't |
| 37 |
|
|
aware of any binary differencer that is able to create patches the can be |
| 38 |
|
|
reversed- basically they're unidirectional, the patch that is generated can |
| 39 |
|
|
only be used to upgrade or downgrade the version, not both. The plus side of |
| 40 |
|
|
this limitation is a significantly decreased patch size. |
| 41 |
|
|
|
| 42 |
|
|
The choice of binary patches over diff patches pretty much comes down to the |
| 43 |
|
|
fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is |
| 44 |
|
|
75kb, the equivalent diff patch is 123kb, and is unable to result in a correct |
| 45 |
|
|
md5 [1]_. |
| 46 |
|
|
|
| 47 |
|
|
Currently, this glep is proposing only the usage of binary patches- that's not |
| 48 |
|
|
to say (with a fair amount of work) it couldn't be extended to support |
| 49 |
|
|
standard diffs. |
| 50 |
|
|
|
| 51 |
|
|
Rationale |
| 52 |
|
|
========= |
| 53 |
|
|
|
| 54 |
|
|
The difference between source releases typically isn't very large, especially |
| 55 |
|
|
for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and |
| 56 |
|
|
kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is |
| 57 |
|
|
75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2. |
| 58 |
|
|
|
| 59 |
|
|
Specification |
| 60 |
|
|
============= |
| 61 |
|
|
|
| 62 |
|
|
Quite a few sections of gentoo are affected- mirroring, the portage tree, and |
| 63 |
|
|
portage itself. |
| 64 |
|
|
|
| 65 |
|
|
Additions to the tree |
| 66 |
|
|
--------------------- |
| 67 |
|
|
|
| 68 |
|
|
For adding patch info into the tree, this glep proposes a global patch list |
| 69 |
|
|
(stored in profiles as patches.global), and individual patch lists stored in |
| 70 |
|
|
relevant package directories (named patches). Using the kernel packages as an |
| 71 |
|
|
example, a global list of patches enables us to create a patch once, add an |
| 72 |
|
|
entry, and have all kernel packages benefit from that single entry. Both |
| 73 |
|
|
patches.global, and individual package patch files share the same format: |
| 74 |
|
|
|
| 75 |
|
|
:: |
| 76 |
|
|
|
| 77 |
|
|
MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size |
| 78 |
|
|
|
| 79 |
|
|
For those familiar with digest file layout, this should look familiar. |
| 80 |
|
|
Essentially, chksum type, value, filename, size. The UMD5 chksum type is just |
| 81 |
|
|
the uncompressed md5/size of the file- so if the UMD5 were for a bzip2 |
| 82 |
|
|
compressed file, it would be the md5 value/size of the uncompressed file. |
| 83 |
|
|
And an example: |
| 84 |
|
|
|
| 85 |
|
|
:: |
| 86 |
|
|
|
| 87 |
|
|
MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320 |
| 88 |
|
|
|
| 89 |
|
|
In the above example, the md5sum of |
| 90 |
|
|
http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is |
| 91 |
|
|
calculated, compared to the stored value, and then the file size is checked. |
| 92 |
|
|
The one difference is the UMD5 checksum type- the md5 value and the size are |
| 93 |
|
|
specific to the *uncompressed* file. Continuing, for cases where the patch |
| 94 |
|
|
will reside on one of our mirrors, the patch filename would be sufficient. |
| 95 |
|
|
|
| 96 |
|
|
Finally, note that this is a unidirectional patch- using the above example, |
| 97 |
|
|
kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not |
| 98 |
|
|
in reverse (originally explained in `Binary patches vs GNUDiff patches`_). |
| 99 |
|
|
|
| 100 |
|
|
Portage Implementation |
| 101 |
|
|
---------------------- |
| 102 |
|
|
|
| 103 |
|
|
This glep proposes the patching support should be (at this stage) optional- |
| 104 |
|
|
specifically, enabled via FEATURES="patching". |
| 105 |
|
|
|
| 106 |
|
|
Fetching |
| 107 |
|
|
'''''''' |
| 108 |
|
|
|
| 109 |
|
|
When patching is enabled, the global patch list is read, and the packages |
| 110 |
|
|
patch list is read. From there, portage determines what files could be used |
| 111 |
|
|
as a base for patching to the desired file- further, determining if it's |
| 112 |
|
|
actually worth patching (case where it wouldn't be is when the target file is |
| 113 |
|
|
less then the sum of the patches needed). Any patches to be used are fetched, |
| 114 |
|
|
and md5 verified. |
| 115 |
|
|
|
| 116 |
|
|
Reconstruction |
| 117 |
|
|
'''''''''''''' |
| 118 |
|
|
|
| 119 |
|
|
Upon fetching and md5 verification of patch(es), the desired file is |
| 120 |
|
|
reconstructed. Assuming reconstruction didn't return any errors, the target |
| 121 |
|
|
file has its uncompressed md5sum calculated and verified, then is recompressed |
| 122 |
|
|
and the compressed md5sum calculated. At this point, if the compressed md5 |
| 123 |
|
|
matches the md5 stored in the tree, then portage transfers the file into |
| 124 |
|
|
distfiles, and continues on it's merry way. |
| 125 |
|
|
|
| 126 |
|
|
If the compressed md5 is different from the tree's value, then the (proposed) |
| 127 |
|
|
md5 database is updated with new compressed md5. Details of this database |
| 128 |
|
|
(and the issue it addresses) follow. |
| 129 |
|
|
|
| 130 |
|
|
Compressed MD5sums: |
| 131 |
|
|
''''''''''''''''''' |
| 132 |
|
|
|
| 133 |
|
|
There will be instances where a file is reconstructed perfectly, recompressed, |
| 134 |
|
|
and the recompressed md5sum differs from what is stored in the tree- the |
| 135 |
|
|
problem is that the md5sum of a compressed file is inherently tied to the |
| 136 |
|
|
compressor version/options used to compress the original source. |
| 137 |
|
|
|
| 138 |
|
|
===================== |
| 139 |
|
|
The Problem in Detail |
| 140 |
|
|
===================== |
| 141 |
|
|
|
| 142 |
|
|
A good example of this problem is related to bzip2 versions used for |
| 143 |
|
|
compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in |
| 144 |
|
|
the compressor resulting in a slightly better compression result- end result |
| 145 |
|
|
being a different file, eg a different md5sum. Assuming compressor versions |
| 146 |
|
|
are the same, there also is the issue of what compression level the target |
| 147 |
|
|
source was originally compressed at- was it compressed with -9, -8 or -7? |
| 148 |
|
|
That's just a sampling of the various original settings that must be accounted |
| 149 |
|
|
for, and that's limited to gzip/bzip2; other compressors will add to the |
| 150 |
|
|
number of variables to be accounted for to produce an exact recreation of the |
| 151 |
|
|
compressed md5sum. |
| 152 |
|
|
|
| 153 |
|
|
Tracking the compressor version and options originally used isn't really a |
| 154 |
|
|
valid option- assuming all options were accounted for, clients would still be |
| 155 |
|
|
required to have multiple versions of the same compressor installed just for |
| 156 |
|
|
the sake of recreating a compressed md5sum *even though* the uncompressed |
| 157 |
|
|
source's md5 has already been verified. |
| 158 |
|
|
|
| 159 |
|
|
===================== |
| 160 |
|
|
The Proposed Solution |
| 161 |
|
|
===================== |
| 162 |
|
|
|
| 163 |
|
|
The creation of a clientside flatfile/db of valid alternate md5/size pairs |
| 164 |
|
|
would enable portage to handle perfectly reconstructed files, that have a |
| 165 |
|
|
different md5sum due to compression differences. The proposed format is thus: |
| 166 |
|
|
|
| 167 |
|
|
:: |
| 168 |
|
|
|
| 169 |
|
|
MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size |
| 170 |
|
|
|
| 171 |
|
|
Example: |
| 172 |
|
|
|
| 173 |
|
|
:: |
| 174 |
|
|
|
| 175 |
|
|
MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187 |
| 176 |
|
|
|
| 177 |
|
|
An alternate md5/size pair for a file would be added **only** when the |
| 178 |
|
|
uncompressed source's md5/size has been verified, yet upon recompression the |
| 179 |
|
|
md5 differs. For cleansing of older md5/size pairs from this db, a utility |
| 180 |
|
|
would be required- the author suggests the addition of a distfiles-cleaning |
| 181 |
|
|
utility to portage, with the ability to also cleanse old md5/size pairs when |
| 182 |
|
|
the file the pair was created for no longer exists in distfiles. |
| 183 |
|
|
|
| 184 |
|
|
Where to store the database is debatable- /etc/portage or /var/cache/edb are |
| 185 |
|
|
definite options. |
| 186 |
|
|
|
| 187 |
|
|
The reasoning for allowing for an optional new-name is that it provides needed |
| 188 |
|
|
functionality should anyone attempt to extend portage to allow for clients to |
| 189 |
|
|
change the compression used for a source (eg, recompress all gzip files as |
| 190 |
|
|
bzip2). Granted, no such code or attempt has been made, but nothing is lost |
| 191 |
|
|
by leaving the option open should the request/attempt be made. |
| 192 |
|
|
|
| 193 |
|
|
A potential gotcha of adding this support is that in environments where the |
| 194 |
|
|
distfiles directory is shared out to multiple systems, this db must be shared |
| 195 |
|
|
also. |
| 196 |
|
|
|
| 197 |
|
|
|
| 198 |
|
|
|
| 199 |
|
|
Distfile Mirror Additions |
| 200 |
|
|
------------------------- |
| 201 |
|
|
|
| 202 |
|
|
One issue of contention is where these files will actually be stored. As of |
| 203 |
|
|
the writing of this glep, a full distfiles mirror is roughly around 40 gb- a |
| 204 |
|
|
rough estimate by the author places the space requirements for patches for |
| 205 |
|
|
each version at a total of around 4gb. Note this isn't even remotely a hard |
| 206 |
|
|
figure yet, and a better figure is being checked into currently. |
| 207 |
|
|
|
| 208 |
|
|
Regardless of the exact space figure, finding a place to store the patches |
| 209 |
|
|
will be problematic. Expansion of the required mirror space (essentially just |
| 210 |
|
|
swallowing the patches storage requirement) is unlikely, since it was one of |
| 211 |
|
|
the main arguements against the now defunct glep9 attempt [2]_. A couple of |
| 212 |
|
|
ideas that have been put forth to handle the additional space requirements are |
| 213 |
|
|
as follows- |
| 214 |
|
|
|
| 215 |
|
|
1) Identification of mirrors willing to handle the extra space requirements- |
| 216 |
|
|
essentially create an additional patch mirror tier. |
| 217 |
|
|
|
| 218 |
|
|
2) Mirroring only a patch for certain package versions, rather then full |
| 219 |
|
|
source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored |
| 220 |
|
|
(rather then the full 10.53 MB source). Downside to this approach is that a |
| 221 |
|
|
user who is downloading kdelibs for the first time would either need to pull |
| 222 |
|
|
it from the original SRC_URI (placing the burden onto the upstream mirror), or |
| 223 |
|
|
pull the 3.1.4 version, and the patch- pulling 63k more then if they had just |
| 224 |
|
|
pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an |
| 225 |
|
|
optimal case- not all versions will have such miniscule patches. |
| 226 |
|
|
|
| 227 |
|
|
3) A variation on the idea above, essentially mirroring only the patch for |
| 228 |
|
|
the oldest version(s) of a package; eg, kdelibs currently has version 3.05, |
| 229 |
|
|
3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not |
| 230 |
|
|
full source (think RESTRICT="fetch"). One plus to this is that patches to |
| 231 |
|
|
downgrade in version are smaller then the patches to upgrade in version- there |
| 232 |
|
|
are exceptions to this, but they're hard to find. A major downside to this |
| 233 |
|
|
approach is A) a user would have to sync up to get the patchlists for that |
| 234 |
|
|
version, B) creation of a set of patches to go backwards in version (see |
| 235 |
|
|
`Binary patches vs GNUDiff patches`_).. |
| 236 |
|
|
|
| 237 |
|
|
Of the options listed above, the first is the easiest, although the second |
| 238 |
|
|
could be made to work. Feedback and any possible alternatives would be |
| 239 |
|
|
greatly appreciated. |
| 240 |
|
|
|
| 241 |
|
|
Patch Creation |
| 242 |
|
|
-------------- |
| 243 |
|
|
|
| 244 |
|
|
Maintenance of patch lists, and the actual patch creation ought to be managed |
| 245 |
|
|
by a high level script- essentally a dev says "I want a patch between this |
| 246 |
|
|
version, and that version: make it so", the script churns away |
| 247 |
|
|
creating/updating the patch list, and generating the patch locally. The |
| 248 |
|
|
utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org |
| 249 |
|
|
(exempting if it's not a locally generated patch), and repoman is used to |
| 250 |
|
|
commit the updated patch list. |
| 251 |
|
|
|
| 252 |
|
|
What would be preferable (although possibly wishful thinking), is if hardware |
| 253 |
|
|
could be co-opted for automatic patch generation, rather then forcing it upon |
| 254 |
|
|
the devs- something akin to how files are pulled onto the mirror automatically |
| 255 |
|
|
for new ebuilds. |
| 256 |
|
|
|
| 257 |
|
|
The initial bulk of patches to get will be generated by the author, to ease |
| 258 |
|
|
the transition and offer patches for people to test out. |
| 259 |
|
|
|
| 260 |
|
|
Backwards Compatability |
| 261 |
|
|
======================= |
| 262 |
|
|
|
| 263 |
|
|
As noted in `The Proposed Solution`_, a system using patching and sharing out |
| 264 |
|
|
it's distfiles must share out it's alternate md5 db. Any system that uses the |
| 265 |
|
|
distfiles share must support the alternate md5 db also. If this is considered |
| 266 |
|
|
enough of an issue, it is conceivable to place reconstructed sources with an |
| 267 |
|
|
alternate md5 into a subdirectory of distdir- portage only looks within |
| 268 |
|
|
distdir, unwilling to descend into subdirectories. |
| 269 |
|
|
|
| 270 |
|
|
Also note that `Distfile Mirror Additions`_ may add additional backwards |
| 271 |
|
|
compatability issues, depending on what solution is accepted. |
| 272 |
|
|
|
| 273 |
|
|
Reference Implementation |
| 274 |
|
|
======================== |
| 275 |
|
|
|
| 276 |
|
|
TODO |
| 277 |
|
|
|
| 278 |
|
|
References |
| 279 |
|
|
========== |
| 280 |
|
|
.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2. |
| 281 |
|
|
.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball) |
| 282 |
|
|
Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious. |
| 283 |
|
|
.. [3] Glep9, 'Gentoo Package Update System' |
| 284 |
|
|
(http://glep.gentoo.org/glep-0009.html) |
| 285 |
|
|
|
| 286 |
|
|
Copyright |
| 287 |
|
|
========= |
| 288 |
|
|
|
| 289 |
|
|
This document has been placed in the public domain. |