| 1 |
GLEP: 25
|
| 2 |
Title: Distfile Patching Support
|
| 3 |
Version: $Revision: 1.2 $
|
| 4 |
Last-Modified: $Date: 2004/11/11 21:34:36 $
|
| 5 |
Author: Brian Harring <ferringb@gentoo.org>
|
| 6 |
Status: deferred
|
| 7 |
Type: Standards Track
|
| 8 |
Content-Type: text/x-rst
|
| 9 |
Created: 6-Mar-2004
|
| 10 |
Post-History: 4-Apr-2004, 11-Nov-2004
|
| 11 |
|
| 12 |
Abstract
|
| 13 |
========
|
| 14 |
|
| 15 |
The intention of this GLEP is to propose the creation of patching support for
|
| 16 |
portage, and iron out the implementation details.
|
| 17 |
|
| 18 |
Status
|
| 19 |
======
|
| 20 |
|
| 21 |
Timed out
|
| 22 |
|
| 23 |
|
| 24 |
Motivation
|
| 25 |
==========
|
| 26 |
|
| 27 |
Reduce the bandwidth load placed on our mirrors by decreasing the amount of
|
| 28 |
bytes transferred when upgrading between versions. Side benefit of this is to
|
| 29 |
significantly decrease the download requirements for users lacking broadband.
|
| 30 |
|
| 31 |
Binary patches vs GNUDiff patches
|
| 32 |
=================================
|
| 33 |
|
| 34 |
Most people are familiar with diff patches (unified diff for example)- this
|
| 35 |
glep is specifically proposing the use of an actual binary differencer. The
|
| 36 |
reason for this is that diff patches are line based- you change a single
|
| 37 |
character in a line, and the whole line must be included in the patch. Binary
|
| 38 |
differencers work at the byte level- it encodes just that byte. In that
|
| 39 |
respect binary patches are often much more efficient then diff patches.
|
| 40 |
|
| 41 |
Further, the ability to reverse a unified patch is due to the fact the diff
|
| 42 |
includes **both** the original line, and the modified line. The author isn't
|
| 43 |
aware of any binary differencer that is able to create patches the can be
|
| 44 |
reversed- basically they're unidirectional, the patch that is generated can
|
| 45 |
only be used to upgrade or downgrade the version, not both. The plus side of
|
| 46 |
this limitation is a significantly decreased patch size.
|
| 47 |
|
| 48 |
The choice of binary patches over diff patches pretty much comes down to the
|
| 49 |
fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is
|
| 50 |
75kb, the equivalent diff patch is 123kb, and is unable to result in a correct
|
| 51 |
md5 [1]_.
|
| 52 |
|
| 53 |
Currently, this glep is proposing only the usage of binary patches- that's not
|
| 54 |
to say (with a fair amount of work) it couldn't be extended to support
|
| 55 |
standard diffs.
|
| 56 |
|
| 57 |
Rationale
|
| 58 |
=========
|
| 59 |
|
| 60 |
The difference between source releases typically isn't very large, especially
|
| 61 |
for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and
|
| 62 |
kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is
|
| 63 |
75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2.
|
| 64 |
|
| 65 |
Specification
|
| 66 |
=============
|
| 67 |
|
| 68 |
Quite a few sections of gentoo are affected- mirroring, the portage tree, and
|
| 69 |
portage itself.
|
| 70 |
|
| 71 |
Additions to the tree
|
| 72 |
---------------------
|
| 73 |
|
| 74 |
For adding patch info into the tree, this glep proposes a global patch list
|
| 75 |
(stored in profiles as patches.global), and individual patch lists stored in
|
| 76 |
relevant package directories (named patches). Using the kernel packages as an
|
| 77 |
example, a global list of patches enables us to create a patch once, add an
|
| 78 |
entry, and have all kernel packages benefit from that single entry. Both
|
| 79 |
patches.global, and individual package patch files share the same format:
|
| 80 |
|
| 81 |
::
|
| 82 |
|
| 83 |
MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size
|
| 84 |
|
| 85 |
For those familiar with digest file layout, this should look familiar.
|
| 86 |
Essentially, chksum type, value, filename, size. The UMD5 chksum type is just
|
| 87 |
the uncompressed md5/size of the file- so if the UMD5 were for a bzip2
|
| 88 |
compressed file, it would be the md5 value/size of the uncompressed file.
|
| 89 |
And an example:
|
| 90 |
|
| 91 |
::
|
| 92 |
|
| 93 |
MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320
|
| 94 |
|
| 95 |
In the above example, the md5sum of
|
| 96 |
http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is
|
| 97 |
calculated, compared to the stored value, and then the file size is checked.
|
| 98 |
The one difference is the UMD5 checksum type- the md5 value and the size are
|
| 99 |
specific to the *uncompressed* file. Continuing, for cases where the patch
|
| 100 |
will reside on one of our mirrors, the patch filename would be sufficient.
|
| 101 |
|
| 102 |
Finally, note that this is a unidirectional patch- using the above example,
|
| 103 |
kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not
|
| 104 |
in reverse (originally explained in `Binary patches vs GNUDiff patches`_).
|
| 105 |
|
| 106 |
Portage Implementation
|
| 107 |
----------------------
|
| 108 |
|
| 109 |
This glep proposes the patching support should be (at this stage) optional-
|
| 110 |
specifically, enabled via FEATURES="patching".
|
| 111 |
|
| 112 |
Fetching
|
| 113 |
''''''''
|
| 114 |
|
| 115 |
When patching is enabled, the global patch list is read, and the packages
|
| 116 |
patch list is read. From there, portage determines what files could be used
|
| 117 |
as a base for patching to the desired file- further, determining if it's
|
| 118 |
actually worth patching (case where it wouldn't be is when the target file is
|
| 119 |
less then the sum of the patches needed). Any patches to be used are fetched,
|
| 120 |
and md5 verified.
|
| 121 |
|
| 122 |
Reconstruction
|
| 123 |
''''''''''''''
|
| 124 |
|
| 125 |
Upon fetching and md5 verification of patch(es), the desired file is
|
| 126 |
reconstructed. Assuming reconstruction didn't return any errors, the target
|
| 127 |
file has its uncompressed md5sum calculated and verified, then is recompressed
|
| 128 |
and the compressed md5sum calculated. At this point, if the compressed md5
|
| 129 |
matches the md5 stored in the tree, then portage transfers the file into
|
| 130 |
distfiles, and continues on it's merry way.
|
| 131 |
|
| 132 |
If the compressed md5 is different from the tree's value, then the (proposed)
|
| 133 |
md5 database is updated with new compressed md5. Details of this database
|
| 134 |
(and the issue it addresses) follow.
|
| 135 |
|
| 136 |
Compressed MD5sums:
|
| 137 |
'''''''''''''''''''
|
| 138 |
|
| 139 |
There will be instances where a file is reconstructed perfectly, recompressed,
|
| 140 |
and the recompressed md5sum differs from what is stored in the tree- the
|
| 141 |
problem is that the md5sum of a compressed file is inherently tied to the
|
| 142 |
compressor version/options used to compress the original source.
|
| 143 |
|
| 144 |
=====================
|
| 145 |
The Problem in Detail
|
| 146 |
=====================
|
| 147 |
|
| 148 |
A good example of this problem is related to bzip2 versions used for
|
| 149 |
compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in
|
| 150 |
the compressor resulting in a slightly better compression result- end result
|
| 151 |
being a different file, eg a different md5sum. Assuming compressor versions
|
| 152 |
are the same, there also is the issue of what compression level the target
|
| 153 |
source was originally compressed at- was it compressed with -9, -8 or -7?
|
| 154 |
That's just a sampling of the various original settings that must be accounted
|
| 155 |
for, and that's limited to gzip/bzip2; other compressors will add to the
|
| 156 |
number of variables to be accounted for to produce an exact recreation of the
|
| 157 |
compressed md5sum.
|
| 158 |
|
| 159 |
Tracking the compressor version and options originally used isn't really a
|
| 160 |
valid option- assuming all options were accounted for, clients would still be
|
| 161 |
required to have multiple versions of the same compressor installed just for
|
| 162 |
the sake of recreating a compressed md5sum *even though* the uncompressed
|
| 163 |
source's md5 has already been verified.
|
| 164 |
|
| 165 |
=====================
|
| 166 |
The Proposed Solution
|
| 167 |
=====================
|
| 168 |
|
| 169 |
The creation of a clientside flatfile/db of valid alternate md5/size pairs
|
| 170 |
would enable portage to handle perfectly reconstructed files, that have a
|
| 171 |
different md5sum due to compression differences. The proposed format is thus:
|
| 172 |
|
| 173 |
::
|
| 174 |
|
| 175 |
MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size
|
| 176 |
|
| 177 |
Example:
|
| 178 |
|
| 179 |
::
|
| 180 |
|
| 181 |
MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187
|
| 182 |
|
| 183 |
An alternate md5/size pair for a file would be added **only** when the
|
| 184 |
uncompressed source's md5/size has been verified, yet upon recompression the
|
| 185 |
md5 differs. For cleansing of older md5/size pairs from this db, a utility
|
| 186 |
would be required- the author suggests the addition of a distfiles-cleaning
|
| 187 |
utility to portage, with the ability to also cleanse old md5/size pairs when
|
| 188 |
the file the pair was created for no longer exists in distfiles.
|
| 189 |
|
| 190 |
Where to store the database is debatable- /etc/portage or /var/cache/edb are
|
| 191 |
definite options.
|
| 192 |
|
| 193 |
The reasoning for allowing for an optional new-name is that it provides needed
|
| 194 |
functionality should anyone attempt to extend portage to allow for clients to
|
| 195 |
change the compression used for a source (eg, recompress all gzip files as
|
| 196 |
bzip2). Granted, no such code or attempt has been made, but nothing is lost
|
| 197 |
by leaving the option open should the request/attempt be made.
|
| 198 |
|
| 199 |
A potential gotcha of adding this support is that in environments where the
|
| 200 |
distfiles directory is shared out to multiple systems, this db must be shared
|
| 201 |
also.
|
| 202 |
|
| 203 |
|
| 204 |
|
| 205 |
Distfile Mirror Additions
|
| 206 |
-------------------------
|
| 207 |
|
| 208 |
One issue of contention is where these files will actually be stored. As of
|
| 209 |
the writing of this glep, a full distfiles mirror is roughly around 40 gb- a
|
| 210 |
rough estimate by the author places the space requirements for patches for
|
| 211 |
each version at a total of around 4gb. Note this isn't even remotely a hard
|
| 212 |
figure yet, and a better figure is being checked into currently.
|
| 213 |
|
| 214 |
Regardless of the exact space figure, finding a place to store the patches
|
| 215 |
will be problematic. Expansion of the required mirror space (essentially just
|
| 216 |
swallowing the patches storage requirement) is unlikely, since it was one of
|
| 217 |
the main arguements against the now defunct glep9 attempt [2]_. A couple of
|
| 218 |
ideas that have been put forth to handle the additional space requirements are
|
| 219 |
as follows-
|
| 220 |
|
| 221 |
1) Identification of mirrors willing to handle the extra space requirements-
|
| 222 |
essentially create an additional patch mirror tier.
|
| 223 |
|
| 224 |
2) Mirroring only a patch for certain package versions, rather then full
|
| 225 |
source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored
|
| 226 |
(rather then the full 10.53 MB source). Downside to this approach is that a
|
| 227 |
user who is downloading kdelibs for the first time would either need to pull
|
| 228 |
it from the original SRC_URI (placing the burden onto the upstream mirror), or
|
| 229 |
pull the 3.1.4 version, and the patch- pulling 63k more then if they had just
|
| 230 |
pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an
|
| 231 |
optimal case- not all versions will have such miniscule patches.
|
| 232 |
|
| 233 |
3) A variation on the idea above, essentially mirroring only the patch for
|
| 234 |
the oldest version(s) of a package; eg, kdelibs currently has version 3.05,
|
| 235 |
3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not
|
| 236 |
full source (think RESTRICT="fetch"). One plus to this is that patches to
|
| 237 |
downgrade in version are smaller then the patches to upgrade in version- there
|
| 238 |
are exceptions to this, but they're hard to find. A major downside to this
|
| 239 |
approach is A) a user would have to sync up to get the patchlists for that
|
| 240 |
version, B) creation of a set of patches to go backwards in version (see
|
| 241 |
`Binary patches vs GNUDiff patches`_)..
|
| 242 |
|
| 243 |
Of the options listed above, the first is the easiest, although the second
|
| 244 |
could be made to work. Feedback and any possible alternatives would be
|
| 245 |
greatly appreciated.
|
| 246 |
|
| 247 |
Patch Creation
|
| 248 |
--------------
|
| 249 |
|
| 250 |
Maintenance of patch lists, and the actual patch creation ought to be managed
|
| 251 |
by a high level script- essentally a dev says "I want a patch between this
|
| 252 |
version, and that version: make it so", the script churns away
|
| 253 |
creating/updating the patch list, and generating the patch locally. The
|
| 254 |
utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org
|
| 255 |
(exempting if it's not a locally generated patch), and repoman is used to
|
| 256 |
commit the updated patch list.
|
| 257 |
|
| 258 |
What would be preferable (although possibly wishful thinking), is if hardware
|
| 259 |
could be co-opted for automatic patch generation, rather then forcing it upon
|
| 260 |
the devs- something akin to how files are pulled onto the mirror automatically
|
| 261 |
for new ebuilds.
|
| 262 |
|
| 263 |
The initial bulk of patches to get will be generated by the author, to ease
|
| 264 |
the transition and offer patches for people to test out.
|
| 265 |
|
| 266 |
Backwards Compatibility
|
| 267 |
=======================
|
| 268 |
|
| 269 |
As noted in `The Proposed Solution`_, a system using patching and sharing out
|
| 270 |
it's distfiles must share out it's alternate md5 db. Any system that uses the
|
| 271 |
distfiles share must support the alternate md5 db also. If this is considered
|
| 272 |
enough of an issue, it is conceivable to place reconstructed sources with an
|
| 273 |
alternate md5 into a subdirectory of distdir- portage only looks within
|
| 274 |
distdir, unwilling to descend into subdirectories.
|
| 275 |
|
| 276 |
Also note that `Distfile Mirror Additions`_ may add additional backwards
|
| 277 |
compatibility issues, depending on what solution is accepted.
|
| 278 |
|
| 279 |
Reference Implementation
|
| 280 |
========================
|
| 281 |
|
| 282 |
TODO
|
| 283 |
|
| 284 |
References
|
| 285 |
==========
|
| 286 |
.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2.
|
| 287 |
.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball)
|
| 288 |
Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious.
|
| 289 |
.. [3] Glep9, 'Gentoo Package Update System'
|
| 290 |
(http://glep.gentoo.org/glep-0009.html)
|
| 291 |
|
| 292 |
Copyright
|
| 293 |
=========
|
| 294 |
|
| 295 |
This document has been placed in the public domain.
|