| 1 |
g2boojum |
1.1 |
GLEP: 25
|
| 2 |
|
|
Title: Distfile Patching Support
|
| 3 |
|
|
Version: $Revision: 1.0 $
|
| 4 |
|
|
Last-Modified: $Date: 2004/04/01 01:07:40 $
|
| 5 |
|
|
Author: Brian Harring <ferringb@gentoo.org>
|
| 6 |
|
|
Status: Draft
|
| 7 |
|
|
Type: Standards Track
|
| 8 |
|
|
Content-Type: text/x-rst
|
| 9 |
|
|
Created: 6-Mar-2004
|
| 10 |
|
|
Post-History: 4-Apr-2004
|
| 11 |
|
|
|
| 12 |
|
|
Abstract
|
| 13 |
|
|
========
|
| 14 |
|
|
|
| 15 |
|
|
The intention of this GLEP is to propose the creation of patching support for
|
| 16 |
|
|
portage, and iron out the implementation details.
|
| 17 |
|
|
|
| 18 |
|
|
Motivation
|
| 19 |
|
|
==========
|
| 20 |
|
|
|
| 21 |
|
|
Reduce the bandwidth load placed on our mirrors by decreasing the amount of
|
| 22 |
|
|
bytes transferred when upgrading between versions. Side benefit of this is to
|
| 23 |
|
|
significantly decrease the download requirements for users lacking broadband.
|
| 24 |
|
|
|
| 25 |
|
|
Binary patches vs GNUDiff patches
|
| 26 |
|
|
=================================
|
| 27 |
|
|
|
| 28 |
|
|
Most people are familiar with diff patches (unified diff for example)- this
|
| 29 |
|
|
glep is specifically proposing the use of an actual binary differencer. The
|
| 30 |
|
|
reason for this is that diff patches are line based- you change a single
|
| 31 |
|
|
character in a line, and the whole line must be included in the patch. Binary
|
| 32 |
|
|
differencers work at the byte level- it encodes just that byte. In that
|
| 33 |
|
|
respect binary patches are often much more efficient then diff patches.
|
| 34 |
|
|
|
| 35 |
|
|
Further, the ability to reverse a unified patch is due to the fact the diff
|
| 36 |
|
|
includes **both** the original line, and the modified line. The author isn't
|
| 37 |
|
|
aware of any binary differencer that is able to create patches the can be
|
| 38 |
|
|
reversed- basically they're unidirectional, the patch that is generated can
|
| 39 |
|
|
only be used to upgrade or downgrade the version, not both. The plus side of
|
| 40 |
|
|
this limitation is a significantly decreased patch size.
|
| 41 |
|
|
|
| 42 |
|
|
The choice of binary patches over diff patches pretty much comes down to the
|
| 43 |
|
|
fact they're smaller- example being a kdelibs binary patch for 3.1.4->3.1.5 is
|
| 44 |
|
|
75kb, the equivalent diff patch is 123kb, and is unable to result in a correct
|
| 45 |
|
|
md5 [1]_.
|
| 46 |
|
|
|
| 47 |
|
|
Currently, this glep is proposing only the usage of binary patches- that's not
|
| 48 |
|
|
to say (with a fair amount of work) it couldn't be extended to support
|
| 49 |
|
|
standard diffs.
|
| 50 |
|
|
|
| 51 |
|
|
Rationale
|
| 52 |
|
|
=========
|
| 53 |
|
|
|
| 54 |
|
|
The difference between source releases typically isn't very large, especially
|
| 55 |
|
|
for minor releases. As an example, kdelibs-3.1.4.tar.bz2 is 10.53 MB, and
|
| 56 |
|
|
kdelibs-3.1.5.tar.bz2 is 10.54 MB. A bzip2'ed patch between those versions is
|
| 57 |
|
|
75.6 kb [2]_, less then 1% the size of 3.1.5's tbz2.
|
| 58 |
|
|
|
| 59 |
|
|
Specification
|
| 60 |
|
|
=============
|
| 61 |
|
|
|
| 62 |
|
|
Quite a few sections of gentoo are affected- mirroring, the portage tree, and
|
| 63 |
|
|
portage itself.
|
| 64 |
|
|
|
| 65 |
|
|
Additions to the tree
|
| 66 |
|
|
---------------------
|
| 67 |
|
|
|
| 68 |
|
|
For adding patch info into the tree, this glep proposes a global patch list
|
| 69 |
|
|
(stored in profiles as patches.global), and individual patch lists stored in
|
| 70 |
|
|
relevant package directories (named patches). Using the kernel packages as an
|
| 71 |
|
|
example, a global list of patches enables us to create a patch once, add an
|
| 72 |
|
|
entry, and have all kernel packages benefit from that single entry. Both
|
| 73 |
|
|
patches.global, and individual package patch files share the same format:
|
| 74 |
|
|
|
| 75 |
|
|
::
|
| 76 |
|
|
|
| 77 |
|
|
MD5 md5-value patch-url size MD5 md5-value ref-file size UMD5 md5-value new-file size
|
| 78 |
|
|
|
| 79 |
|
|
For those familiar with digest file layout, this should look familiar.
|
| 80 |
|
|
Essentially, chksum type, value, filename, size. The UMD5 chksum type is just
|
| 81 |
|
|
the uncompressed md5/size of the file- so if the UMD5 were for a bzip2
|
| 82 |
|
|
compressed file, it would be the md5 value/size of the uncompressed file.
|
| 83 |
|
|
And an example:
|
| 84 |
|
|
|
| 85 |
|
|
::
|
| 86 |
|
|
|
| 87 |
|
|
MD5 ccd5411b3558326cbce0306fcae32e26 http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 75687 MD5 82c265de78d53c7060a09c5cb1a78942 kdelibs-3.1.4.tar.bz2 10537433 UMD5 0b1908a51e739c07ff5a88e189d2f7a9 kdelibs-3.1.5.tar.bz2 48056320
|
| 88 |
|
|
|
| 89 |
|
|
In the above example, the md5sum of
|
| 90 |
|
|
http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.patch.bz2 is
|
| 91 |
|
|
calculated, compared to the stored value, and then the file size is checked.
|
| 92 |
|
|
The one difference is the UMD5 checksum type- the md5 value and the size are
|
| 93 |
|
|
specific to the *uncompressed* file. Continuing, for cases where the patch
|
| 94 |
|
|
will reside on one of our mirrors, the patch filename would be sufficient.
|
| 95 |
|
|
|
| 96 |
|
|
Finally, note that this is a unidirectional patch- using the above example,
|
| 97 |
|
|
kdelibs-3.1.4-3.1.5 can **only** be used to upgrade from 3.1.4 to 3.1.5, not
|
| 98 |
|
|
in reverse (originally explained in `Binary patches vs GNUDiff patches`_).
|
| 99 |
|
|
|
| 100 |
|
|
Portage Implementation
|
| 101 |
|
|
----------------------
|
| 102 |
|
|
|
| 103 |
|
|
This glep proposes the patching support should be (at this stage) optional-
|
| 104 |
|
|
specifically, enabled via FEATURES="patching".
|
| 105 |
|
|
|
| 106 |
|
|
Fetching
|
| 107 |
|
|
''''''''
|
| 108 |
|
|
|
| 109 |
|
|
When patching is enabled, the global patch list is read, and the packages
|
| 110 |
|
|
patch list is read. From there, portage determines what files could be used
|
| 111 |
|
|
as a base for patching to the desired file- further, determining if it's
|
| 112 |
|
|
actually worth patching (case where it wouldn't be is when the target file is
|
| 113 |
|
|
less then the sum of the patches needed). Any patches to be used are fetched,
|
| 114 |
|
|
and md5 verified.
|
| 115 |
|
|
|
| 116 |
|
|
Reconstruction
|
| 117 |
|
|
''''''''''''''
|
| 118 |
|
|
|
| 119 |
|
|
Upon fetching and md5 verification of patch(es), the desired file is
|
| 120 |
|
|
reconstructed. Assuming reconstruction didn't return any errors, the target
|
| 121 |
|
|
file has its uncompressed md5sum calculated and verified, then is recompressed
|
| 122 |
|
|
and the compressed md5sum calculated. At this point, if the compressed md5
|
| 123 |
|
|
matches the md5 stored in the tree, then portage transfers the file into
|
| 124 |
|
|
distfiles, and continues on it's merry way.
|
| 125 |
|
|
|
| 126 |
|
|
If the compressed md5 is different from the tree's value, then the (proposed)
|
| 127 |
|
|
md5 database is updated with new compressed md5. Details of this database
|
| 128 |
|
|
(and the issue it addresses) follow.
|
| 129 |
|
|
|
| 130 |
|
|
Compressed MD5sums:
|
| 131 |
|
|
'''''''''''''''''''
|
| 132 |
|
|
|
| 133 |
|
|
There will be instances where a file is reconstructed perfectly, recompressed,
|
| 134 |
|
|
and the recompressed md5sum differs from what is stored in the tree- the
|
| 135 |
|
|
problem is that the md5sum of a compressed file is inherently tied to the
|
| 136 |
|
|
compressor version/options used to compress the original source.
|
| 137 |
|
|
|
| 138 |
|
|
=====================
|
| 139 |
|
|
The Problem in Detail
|
| 140 |
|
|
=====================
|
| 141 |
|
|
|
| 142 |
|
|
A good example of this problem is related to bzip2 versions used for
|
| 143 |
|
|
compression. Between bzip2 0.9x and bzip2 1.x, there was a subtle change in
|
| 144 |
|
|
the compressor resulting in a slightly better compression result- end result
|
| 145 |
|
|
being a different file, eg a different md5sum. Assuming compressor versions
|
| 146 |
|
|
are the same, there also is the issue of what compression level the target
|
| 147 |
|
|
source was originally compressed at- was it compressed with -9, -8 or -7?
|
| 148 |
|
|
That's just a sampling of the various original settings that must be accounted
|
| 149 |
|
|
for, and that's limited to gzip/bzip2; other compressors will add to the
|
| 150 |
|
|
number of variables to be accounted for to produce an exact recreation of the
|
| 151 |
|
|
compressed md5sum.
|
| 152 |
|
|
|
| 153 |
|
|
Tracking the compressor version and options originally used isn't really a
|
| 154 |
|
|
valid option- assuming all options were accounted for, clients would still be
|
| 155 |
|
|
required to have multiple versions of the same compressor installed just for
|
| 156 |
|
|
the sake of recreating a compressed md5sum *even though* the uncompressed
|
| 157 |
|
|
source's md5 has already been verified.
|
| 158 |
|
|
|
| 159 |
|
|
=====================
|
| 160 |
|
|
The Proposed Solution
|
| 161 |
|
|
=====================
|
| 162 |
|
|
|
| 163 |
|
|
The creation of a clientside flatfile/db of valid alternate md5/size pairs
|
| 164 |
|
|
would enable portage to handle perfectly reconstructed files, that have a
|
| 165 |
|
|
different md5sum due to compression differences. The proposed format is thus:
|
| 166 |
|
|
|
| 167 |
|
|
::
|
| 168 |
|
|
|
| 169 |
|
|
MD5 md5sum orig-file size MD5 md5sum [ optional new-name ] size
|
| 170 |
|
|
|
| 171 |
|
|
Example:
|
| 172 |
|
|
|
| 173 |
|
|
::
|
| 174 |
|
|
|
| 175 |
|
|
MD5 984146931906a7d53300b29f58f6a899 OOo_1.0.3_source.tar.bz2 165475319 MD5 0733dd85ed44d88d1eabed704d579721 165444187
|
| 176 |
|
|
|
| 177 |
|
|
An alternate md5/size pair for a file would be added **only** when the
|
| 178 |
|
|
uncompressed source's md5/size has been verified, yet upon recompression the
|
| 179 |
|
|
md5 differs. For cleansing of older md5/size pairs from this db, a utility
|
| 180 |
|
|
would be required- the author suggests the addition of a distfiles-cleaning
|
| 181 |
|
|
utility to portage, with the ability to also cleanse old md5/size pairs when
|
| 182 |
|
|
the file the pair was created for no longer exists in distfiles.
|
| 183 |
|
|
|
| 184 |
|
|
Where to store the database is debatable- /etc/portage or /var/cache/edb are
|
| 185 |
|
|
definite options.
|
| 186 |
|
|
|
| 187 |
|
|
The reasoning for allowing for an optional new-name is that it provides needed
|
| 188 |
|
|
functionality should anyone attempt to extend portage to allow for clients to
|
| 189 |
|
|
change the compression used for a source (eg, recompress all gzip files as
|
| 190 |
|
|
bzip2). Granted, no such code or attempt has been made, but nothing is lost
|
| 191 |
|
|
by leaving the option open should the request/attempt be made.
|
| 192 |
|
|
|
| 193 |
|
|
A potential gotcha of adding this support is that in environments where the
|
| 194 |
|
|
distfiles directory is shared out to multiple systems, this db must be shared
|
| 195 |
|
|
also.
|
| 196 |
|
|
|
| 197 |
|
|
|
| 198 |
|
|
|
| 199 |
|
|
Distfile Mirror Additions
|
| 200 |
|
|
-------------------------
|
| 201 |
|
|
|
| 202 |
|
|
One issue of contention is where these files will actually be stored. As of
|
| 203 |
|
|
the writing of this glep, a full distfiles mirror is roughly around 40 gb- a
|
| 204 |
|
|
rough estimate by the author places the space requirements for patches for
|
| 205 |
|
|
each version at a total of around 4gb. Note this isn't even remotely a hard
|
| 206 |
|
|
figure yet, and a better figure is being checked into currently.
|
| 207 |
|
|
|
| 208 |
|
|
Regardless of the exact space figure, finding a place to store the patches
|
| 209 |
|
|
will be problematic. Expansion of the required mirror space (essentially just
|
| 210 |
|
|
swallowing the patches storage requirement) is unlikely, since it was one of
|
| 211 |
|
|
the main arguements against the now defunct glep9 attempt [2]_. A couple of
|
| 212 |
|
|
ideas that have been put forth to handle the additional space requirements are
|
| 213 |
|
|
as follows-
|
| 214 |
|
|
|
| 215 |
|
|
1) Identification of mirrors willing to handle the extra space requirements-
|
| 216 |
|
|
essentially create an additional patch mirror tier.
|
| 217 |
|
|
|
| 218 |
|
|
2) Mirroring only a patch for certain package versions, rather then full
|
| 219 |
|
|
source. Using kdelibs-3.1.5 as an example, only the patch would be mirrored
|
| 220 |
|
|
(rather then the full 10.53 MB source). Downside to this approach is that a
|
| 221 |
|
|
user who is downloading kdelibs for the first time would either need to pull
|
| 222 |
|
|
it from the original SRC_URI (placing the burden onto the upstream mirror), or
|
| 223 |
|
|
pull the 3.1.4 version, and the patch- pulling 63k more then if they had just
|
| 224 |
|
|
pulled the full version. The kdelibs 3.1.4/3.1.5 example is something of an
|
| 225 |
|
|
optimal case- not all versions will have such miniscule patches.
|
| 226 |
|
|
|
| 227 |
|
|
3) A variation on the idea above, essentially mirroring only the patch for
|
| 228 |
|
|
the oldest version(s) of a package; eg, kdelibs currently has version 3.05,
|
| 229 |
|
|
3.1.5, 3.2.0, and 3.2.1- the mirrors would only carry a patch for 3.05, not
|
| 230 |
|
|
full source (think RESTRICT="fetch"). One plus to this is that patches to
|
| 231 |
|
|
downgrade in version are smaller then the patches to upgrade in version- there
|
| 232 |
|
|
are exceptions to this, but they're hard to find. A major downside to this
|
| 233 |
|
|
approach is A) a user would have to sync up to get the patchlists for that
|
| 234 |
|
|
version, B) creation of a set of patches to go backwards in version (see
|
| 235 |
|
|
`Binary patches vs GNUDiff patches`_)..
|
| 236 |
|
|
|
| 237 |
|
|
Of the options listed above, the first is the easiest, although the second
|
| 238 |
|
|
could be made to work. Feedback and any possible alternatives would be
|
| 239 |
|
|
greatly appreciated.
|
| 240 |
|
|
|
| 241 |
|
|
Patch Creation
|
| 242 |
|
|
--------------
|
| 243 |
|
|
|
| 244 |
|
|
Maintenance of patch lists, and the actual patch creation ought to be managed
|
| 245 |
|
|
by a high level script- essentally a dev says "I want a patch between this
|
| 246 |
|
|
version, and that version: make it so", the script churns away
|
| 247 |
|
|
creating/updating the patch list, and generating the patch locally. The
|
| 248 |
|
|
utility next uploads the new patch to /space/distfiles-local on dev.gentoo.org
|
| 249 |
|
|
(exempting if it's not a locally generated patch), and repoman is used to
|
| 250 |
|
|
commit the updated patch list.
|
| 251 |
|
|
|
| 252 |
|
|
What would be preferable (although possibly wishful thinking), is if hardware
|
| 253 |
|
|
could be co-opted for automatic patch generation, rather then forcing it upon
|
| 254 |
|
|
the devs- something akin to how files are pulled onto the mirror automatically
|
| 255 |
|
|
for new ebuilds.
|
| 256 |
|
|
|
| 257 |
|
|
The initial bulk of patches to get will be generated by the author, to ease
|
| 258 |
|
|
the transition and offer patches for people to test out.
|
| 259 |
|
|
|
| 260 |
|
|
Backwards Compatability
|
| 261 |
|
|
=======================
|
| 262 |
|
|
|
| 263 |
|
|
As noted in `The Proposed Solution`_, a system using patching and sharing out
|
| 264 |
|
|
it's distfiles must share out it's alternate md5 db. Any system that uses the
|
| 265 |
|
|
distfiles share must support the alternate md5 db also. If this is considered
|
| 266 |
|
|
enough of an issue, it is conceivable to place reconstructed sources with an
|
| 267 |
|
|
alternate md5 into a subdirectory of distdir- portage only looks within
|
| 268 |
|
|
distdir, unwilling to descend into subdirectories.
|
| 269 |
|
|
|
| 270 |
|
|
Also note that `Distfile Mirror Additions`_ may add additional backwards
|
| 271 |
|
|
compatability issues, depending on what solution is accepted.
|
| 272 |
|
|
|
| 273 |
|
|
Reference Implementation
|
| 274 |
|
|
========================
|
| 275 |
|
|
|
| 276 |
|
|
TODO
|
| 277 |
|
|
|
| 278 |
|
|
References
|
| 279 |
|
|
==========
|
| 280 |
|
|
.. [1] http://dev.gentoo.org/~ferringb/patches/kdelibs-3.1.4-3.1.5.{patch,diff}.bz2.
|
| 281 |
|
|
.. [2] kdelibs-3.1.4-3.1.5.patch.bz2, switching format patch, created via diffball-0.4_pre4 (diffball is available at http://sourceforge.net/projects/diffball)
|
| 282 |
|
|
Bzip2 -9 compressed, the patch is 75,687 bytes, uncompressed it is 337,649 bytes. The patch is available at http://dev.gentoo.org/~ferringb/kdelibs-3.1.4-3.1.5.patch.bz2 for those curious.
|
| 283 |
|
|
.. [3] Glep9, 'Gentoo Package Update System'
|
| 284 |
|
|
(http://glep.gentoo.org/glep-0009.html)
|
| 285 |
|
|
|
| 286 |
|
|
Copyright
|
| 287 |
|
|
=========
|
| 288 |
|
|
|
| 289 |
|
|
This document has been placed in the public domain.
|