ZIP (file format)

From Wikipedia, the free encyclopedia

Jump to: navigation, search
ZIP

A zip archive icon from the The Unarchiver's icon set.
Filename extension .zip
Internet media type application/zip
Uniform Type Identifier com.pkware.zip-archive
Magic number PK\003\004 or PK\005\006 (empty archive) (unless a bootstrap script is present)
Developed by Phil Katz
Type of format Data compression

The ZIP file format is a data compression and archive format. A ZIP file contains one or more files that have been compressed to reduce file size, or stored as-is. The ZIP file format permits a number of compression algorithms, but as of 2009, only Deflate is widely used and supported.

The format was originally created in 1989 by Phil Katz for PKZIP, and evolved from the previous ARC compression format by Thom Henderson. However, many software utilities other than PKZIP itself are now available to create, modify, or open (unzip, decompress) ZIP files, notably WinZip, BOMArchiveHelper, StuffIt, KGB Archiver, PicoZip, Info-ZIP, WinRAR, IZArc, 7-Zip, ALZip, TUGZip, PeaZip, ZipGenius, and Universal Extractor. Microsoft has included built-in ZIP support (under the name "compressed folders") in versions of its Windows operating system since 1998. Apple has included built-in ZIP support in Mac OS X 10.3 and Mac OS X 10.4 via the BOMArchiveHelper utility, now called Archive Utility in Mac OS X 10.5 . The zip, zipcloak, zipnote, zipsplit tools are used widely in unix-like systems.

ZIP files generally use the file extensions ".zip" or ".ZIP" and the MIME media type application/zip. Some software uses the ZIP file format as a wrapper for a large number of small items in a specific structure; when this is done a different file extension is usually used. Examples of this usage are Java JAR files, Python .egg files, SilverLight .xap files, id Software .pk3/.pk4 files, package files for StepMania and Winamp/Windows Media Player skins, XPInstall, as well as OpenDocument and Office Open XML office formats. Both OpenDocument and Office Open XML formats use the JAR file format internally, so files can be easily uncompressed and compressed using tools for ZIP files. Google Earth makes use of KMZ files, which are just KML files in ZIP format. Mozilla Firefox Add-ons are zip files with extension "xpi". Nokia and Sony Ericsson mobile phone themes are zipped with extension "nth" and "thm", respectively.

Contents

[edit] History

[edit] Early history

During the mid-1980s, System Enhancement Associates, a small company run by Thom Henderson, created a file archiving format called ARC, and a corresponding archiver (also called ARC) that could compress and decompress files into this format.[1] This program was released as shareware for a number of platforms, with the source code included. The file format quickly became a de facto standard.[1] Phil Katz released a file compatible software package on the IBM Intel DOS platform, known as PKXARC. It used hand-optimized 8088 assembly language and was considerably faster than SEA's original cross-platform implementation in C.

The competition from Katz did not please SEA, who sued Katz for trademark and copyright infringement, as it alleged that Katz had plagiarized sections of the code. Katz lost the lawsuit and was forced to pay $62,500 to SEA to cover their legal fees. It was found during the court case[citation needed] that Katz had used SEA's ARC source code for the majority of the application but had only made code optimizations to increase speed. Primarily he changed the word length used by the algorithm from 12 bits to 13 bits resulting in a higher compression for typical binary files. As a result of the lawsuit, Katz changed the names of his utilities to PKPAK and PKUNPAK.

Katz then went on to create his own file format, which is known worldwide now as the ZIP format (commonly called a "ZIP file"). The ZIP format was more resistant to data loss than the ARC format because of redundant catalog storage. It also was more flexible than ARC, providing room for additional optional compression algorithms and future expansion. Along with the new format, PKZIP included at least one compression algorithm more efficient than any supported by ARC. Once PKZIP was released, many users abandoned ARC because of its slower speed and less effective compression, and because SEA alienated many by seeming to suddenly assert proprietary legal rights over the ARC file format after it had become widely used among the on-line community (similar in this respect to the later GIF patents controversy).

Katz publicly released technical documentation on the ZIP file format making it an open format, along with the first version of his PKZIP archiver, in January 1989. Originally only bundled with registered versions of PKZIP, the APPNOTE.TXT documentation file, titled .ZIP File Format Specification, was later available on the PKWARE site.

The name zip (meaning speed) was suggested by Katz's friend, Robert Mahoney. They wanted to imply that their product would be faster than ARC and other compression formats of the time.

[edit] Beyond the command line

In the mid 1990s, as more new computers included graphical user interfaces, fewer users were comfortable with the command-line operation of PKZIP. Seeing an opportunity, shareware authors began pitching compression and archival programs with graphical user interfaces, with many of these using the ZIP format; WinZip was among the most popular. PKWARE also offered a graphical version of PKZIP. These programs were easier to learn than the older command-line equivalents, but users still had to learn a specialized tool with its own interface for file archival and compression.

In the late 1990s, various file managing software started integrating support for the ZIP format into their user interface. Even earlier, Norton Commander and its clones like Volkov Commander in DOS had started that trend, and that remains the norm for the "Commander-like" or orthodox file managers like Midnight Commander (for Linux and UNIX-like systems) and Total Commander (previously Windows Commander; for Windows). The KDE file manager (kfm) supported the ZIP format very early; ZIP support was also first added to Windows Explorer with the Plus! enhancement package in Windows 98 and later included in Windows Me, Windows XP, Windows Vista, and now Windows 7; ZIP format support is also built in the Mac OS Finder (as of Mac OS X, via the BOMArchiveHelper utility), the Nautilus file manager used by GNOME, and the Konqueror file manager of newer versions of KDE. By 2002, all major desktop environments included ZIP file support in their file managers: a ZIP file is typically presented as a directory or folder, so that files are copied into and out of it in the same manner as any other folder and the compression is handled in a way largely transparent to the user. This has eliminated the need to learn a specialized tool and interface for file archival and compression.

As well, ZIP files can be processed on iPhones or mobile devices based on Windows Mobile.

The ZIP format is ubiquitous.

[edit] Confusion among formats

There are numerous standards and formats dealing with compression, and people sometimes confuse or conflate them. For example, ZIP is distinct from GZIP; the former is a standard owned by the PKWare Company, and the latter is defined in an IETF RFC (1952). Both ZIP and GZIP primarily use the DEFLATE algorithm for compression. Likewise, the ZLIB format (IETF RFC 1950) also uses the DEFLATE compression algorithm, but specifies different headers for error and consistency checking.

[edit] Version history

The .ZIP File Format Specification has its own version number, which does not necessarily correspond to the version numbers for the PKZIP tool, especially with PKZIP 6 or later. At various times, PKWARE adds preliminary features that allows PKZIP products to extract archives using advanced features, but PKZIP products that create such archives won't be available until the next major release. Other companies or organizations support the PKWARE specifications at their own pace.

A summary of key advances in various versions of the PKWARE spec:

  • 2.0: File entries can be compressed with DEFLATE.
  • 4.5: Documented 64-bit ZIP format.
  • 5.0: DES, 3DES, RC2, RC4 supported for encryption
  • 5.2: RC2-64 supported for Encryption.
  • 6.1: Documented certificate storage.
  • 6.2.0: Documented Central Directory Encryption.
  • 6.3.0: Documented Unicode (UTF-8) filename storage. Expanded list of supported hash, compression, encryption algorithms.
  • 6.3.1: Corrected standard hash values for SHA-256/384/512.
  • 6.3.2: Documented compression method 97 (WavPack).

[edit] Technical information

ZIP is a simple archive format that compresses every file separately. Compressing files separately allows for individual files to be retrieved without reading through other data; in theory, it may allow better compression by using different algorithms for different files. A caveat to this is that archives containing a large number of small files end up significantly larger than if they were compressed as a single file (the classic example of the latter is the common tar.gz archive which consists of a TAR archive compressed using gzip).

The specification for ZIP indicates that files can be stored either uncompressed or using a variety of compression algorithms, but ZIP is generally used with Katz's DEFLATE algorithm, except when files being added are already compressed or are resistant to compression, in which case the file data is simply STORED uncompressed.

ZIP supports a simple password-based symmetric encryption system which is documented in the ZIP spec, and known to be seriously flawed. In particular it is vulnerable to known-plaintext attacks which are in some cases made worse by poor implementations of random number generators.[2] The ZIP spec also supports spreading archives across multiple filesystem files. Originally intended for storage of large zip files across multiple 1,44mb floppy disks, this feature is now used for sending zip archives in parts over email, or over other transports or removable media).

New features including new compression and encryption (e.g. AES) methods have been documented to .ZIP File Format Specification since version 5.2. A WinZip-developed AES-based standard is used also by 7-Zip, XCeed, and DotNetZip, but some vendors use other formats.[3] PKWARE SecureZIP also supports DC2, DC4, DES, 3DES encryption methods, Digital Certificate-based encryption and authentication (X.509), and archive header encryption.[4]

The original ZIP format had a 4.2gb limit on various things (uncompressed size of a file, compressed size of a file and total size of the archive), as well as a limit of 65535 entries in a zip archive. In version 4.5 of the specification (which is not the same as v4.5 of any particular tool), PKWARE introduced the "ZIP64" format extensions to get around these limitations. Zip64 support is emerging. For example, the File Explorer in Windows XP does not support ZIP64, but the Explorer in Windows Vista does. Likewise - some libraries, such as IO::Compress::Zip in Perl, have new support for ZIP64, while others, such as Java's built-in java.util.zip, still lack it.

The FAT filesystem of DOS only has a timestamp resolution of two seconds; ZIP file records mimic this. As a result, the built-in timestamp resolution of files in a ZIP archive is only two seconds, though extra fields can be used to store more accurate timestamps.

Since September 2007, the ZIP specification (APPNOTE.TXT) contains a provision to store file names using UTF-8, finally adding Unicode compatibility to ZIP.

Not all of the zip features are implemented by the all the various libraries and zip toolkits.

[edit] The structure of a ZIP file

The ZIP file contents are files and directories which are stored in arbitrary order. The location of a file is indicated in the so called central directory which is located at the end of the ZIP file. The files and directories are represented by file entries.

Each file entry is introduced by a local header with information about the file such as the comment, file size and file name, followed by optional "Extra" data fields, and then the possibly compressed, possibly encrypted file data. The "Extra" data fields are the key to the extensibility of the ZIP format. It is the "Extra" fields that are exploited to support ZIP64 formats, WinZip-compatible AES encryption, and NTFS file timestamps. In theory there are many other extensions possible via this coded "extra" field.

The central directory consists of file headers holding, among other metadata, the file names and the relative offset in the archive of the local headers for each file entry.

Each file entry is marked by a specific 4-byte "signature"; each entry in the central directory is likewise marked with a different particular 4-byte signature. ZIP file parsers typically look for the appropriate signatures when parsing a ZIP file.

Due to the fact that the order of the file entries in the directory need not conform to the order of file entries in the archive, the format is non-sequential.

There is no BOF or EOF marker in the ZIP spec. Instead, ZIP tools scan for the signatures of the various fields.

Image:ZIPformat.jpg

[edit] Combining ZIP with other file formats

The ZIP file format allows for a comment containing any data to occur at the end of the file after the central directory.[5] Also, because the central directory specifies the offset of each file in the archive with respect to the start, it is possible in practice for the first file entry to start at an offset other than zero.

This allows arbitrary data to occur in the file both before and after the ZIP archive data, and for the archive to still be read by a ZIP application. A side-effect of this is that it is possible to author a file that is both a working ZIP archive and another format, provided that the other format tolerates arbitrary data at its end, beginning, or middle. Self-extracting archives (SFX), of the form supported by WinZip and DotNetZip, take advantage of this - they are .exe files that conform to the PKZIP AppNote.txt specification and can be read by compliant zip tools or libraries.

This property of the ZIP format, and of the JAR format which is a variant of ZIP, can be exploited to hide harmful Java classes inside a seemingly harmless file, such as a GIF image uploaded to the web. This so-called GIFAR exploit has been demonstrated as an effective attack against web applications such as Facebook.[6]

[edit] Implementing a ZIP application

There are numerous ZIP tools available, and numerous ZIP libraries for various programming environments. Some of the libraries are commercial, some are not. Some are open source, some are not. WinZip is perhaps the most popular and famous ZIP tool - it runs primarily on Windows and is a user tool for creating or extracing ZIP files. WinRAR, IZarc, Info-zip, 7-zip are other tools, available on various platforms. Some of those tools have library or programmatic interfaces.

There are some useful development libraries which are available as open source contributions such as the GNU gzip project and Info-ZIP. For Java, there are a few options: Java Platform, Standard Edition contains the package "java.util.zip" to handle standard zip files; the Zip64File library specifically supports large files (larger than 4GB) and treats ZIP files using random access; and the Apache Ant tool contains a more complete implementation released under the Apache Software License.

For .NET applications, there is a no-cost library called DotNetZip [7] available in source and binary form under the Microsoft Public License [8]. It does passwords for symmetric ZIP encryption, Unicode, ZIP64, and WinZip-compatible AES encryption. The Microsoft .NET 3.5 runtime library includes a class System.IO.Packaging.Package[9] that supports the ZIP format, but it is primarily designed for Microsoft's document formats (xlsx, pptx, docx, xps), and is somewhat unnatural to use for generic zip files.

The Info-ZIP implementations of the ZIP format adds support for Unix filesystem features, such as user and group IDs, file permissions, and support for symbolic links. The Apache Ant implementation is aware of these to the extent that it can create files with predefined Unix permissions. The Info-ZIP implementations also know how to use the error correction capabilities built into the ZIP compression format. Some programs (such as IZArc) do not and will choke on a file that has errors.

The Info-ZIP Windows tools also support NTFS filesystem permissions, and will make an attempt to translate from NTFS permissions to Unix permissions or vice-versa when extracting files. This can result in potentially unintended combinations, e.g. .exe files being created on NTFS volumes with executable permission denied.

[edit] Strong encryption controversy

When WinZip 9.0 public beta was released in 2003, WinZip introduced its own AES-256 encryption, using a different file format, along with the documentation for the new specification. [10] The encryption standards themselves were not proprietary, but PKWARE had not updated APPNOTE.TXT to include Strong Encryption Specification (SES) since 2001, which had been used by PKZIP versions 5.0 and 6.0. WinZip technical consultant Kevin Kearney and StuffIt product manager Mathew Covington accused PKWARE of withholding SES, but PKZIP chief technology officer Jim Peterson claimed that Certificate-based encryption was still incomplete. However, the latest publicly available APPNOTE.TXT at the time was version 4.5 (available on PKWARE's FTP site), which not only omitted SES, but also omitted Deflate64, DCL Implode, BZip2 compression methods used by .ZIP files created by contemporary PKZIP products.

To overcome this shortcoming, contemporary products such as PentaZip 'implemented' strong ZIP encryption by encrypting ZIP archives into a different file format.[11]

In another controversial move, PKWare applied for a patent in 2003-07-16 describing a method for combining .ZIP and strong encryption to create a secure .ZIP file.[12]

In the end, PKWARE and WinZip agreed to support each other's products. On 2004-01-21, PKWARE announced the support of WinZip-based AES compression format.[13] In later version of WinZip beta, it is able to support SES-based ZIP files.[14] PKWARE eventually released version 5.2 of .ZIP File Format Specification to public, which documented SES.

[edit] See also

[edit] References

[edit] External links

Personal tools