Difference between pages "Metadata" and "Gzip"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
 
(File format)
 
Line 1: Line 1:
'''Metadata''' is data about data. Metadata plays a number of important roles in [[computer forensics]]:
+
{{expand}}
* It can provide corroborating information about the document data itself.
+
* It can reveal information that someone tried to hide, delete, or obscure.
+
* It can be used to automatically correlate documents from different sources.
+
  
Since metadata is fundamentally data, it suffers all of the data quality and pedigre issues as any other form of data. Nevertheless, because metadata isn't generally visible unless you use a special tool, more skill is required to alter or otherwise manipulate it.
+
== File format ==
 +
The gzip file (.gz) format consists of:
 +
* a file header
 +
* optional headers
 +
** extra fields
 +
** original file name
 +
** comment
 +
** header checksum
 +
* a body, containing a DEFLATE-compressed payload
 +
* a file footer
  
==Kinds of Metadata==
+
The gzip format uses little-endian.
Some kinds of metadata that are interesting in computer forensics:
+
* [[File system]] metadata (e.g. [[MAC times]], [[access control lists]], etc.)
+
* Digital image metadata. Although information such as the image size and number of colors are technically metadata, [[JPEG]] and other file formats store additional data about the photo or the device that acquired it.
+
* Document metadata, such as the creator of a document, it's last print time, etc.
+
  
==File types that support metadata and extraction tools==
+
{| class="wikitable"
 +
! align="left"| Offset
 +
! Characteristics
 +
! Description
 +
|-
 +
| Byte order
 +
| little-endian
 +
|-
 +
| Date and time values
 +
| Filetime in UTC
 +
|-
 +
| Character string
 +
| ISO 8859-1 (LATIN-1)
 +
|}
  
Below are some common data and metadata formats, the files in which they are found, and a collection of tools that can be used to extract information.
+
=== File header ===
 +
The file header is 10 bytes in size and contains:
 +
{| class="wikitable"
 +
! align="left"| Offset
 +
! Size
 +
! Value
 +
! Description
 +
|-
 +
| 0
 +
| 2
 +
| 0x1f 0x8b
 +
| Signature (or identification byte 1 and 2)
 +
|-
 +
| 2
 +
| 1
 +
|
 +
| Compression Method
 +
|-
 +
| 3
 +
| 1
 +
|
 +
| Flags
 +
|-
 +
| 4
 +
| 4
 +
|
 +
| Last modification time <br> Contains a POSIX timestamp.
 +
|-
 +
| 8
 +
| 1
 +
|
 +
| Extra flags
 +
|-
 +
| 9
 +
| 1
 +
|
 +
| Operating system <br> Value that indicates on which operating system the gzip file was created.
 +
|}
  
; [[EXIF]] ([[JPEG]] and [[TIFF]] image files; Music Files)
+
==== Compression method ====
: The [[Exchangeable Image File]] format describes a format for a block of data that can be embedded into JPEG and TIFF image files, as well as [[RIFF WAVE]] audio files. Information includes date and time information, camera settings, location information, textual descriptions, and copyright information.
+
:* [http://pel.sourceforge.net/ PEL: PHP Exif Library]
+
:* [http://libexif.sourceforge.net/ LibExif] (C)
+
:* [http://www.drewnoakes.com/code/exif/ Metadata extraction in Java]
+
  
; [[ID3]] ([[MP3]] files)
+
{| class="wikitable"
: Implemented as a small block of data stored at the end of MP3 files. [[ID3v1]] is a 128-byte block in a specified format allowing 30 bytes for song, artist and album, 4 bytes for year, 30 bytes for comment, and 1 byte for genre. [[ID3v1.1]] adds a track number. [[ID3v2]] is a general container structure. For more information, see [http://www.id3.org/].
+
! align="left"| Value
:* [http://id3lib.sourceforge.net/ id3lib], a widely-used open source C/C++ ID3 implementation.
+
! Identifier
:* [http://www.vdheide.de/projects.html Java library MP3]
+
! Description
:* [http://search.cpan.org/dist/MP3-Info/ MP3::Info] (Perl)
+
|-
:* [http://search.cpan.org/dist/MPEG-ID3v2Tag/ MPEG::ID3v2Tag] (Perl)
+
| 0 - 7
 +
|
 +
| Reserved
 +
|-
 +
| 8
 +
| "deflate"
 +
| zlib compressed data
 +
|}
  
; [[Microsoft]] [[OLE Compound File]]
+
==== Flags ====
: Microsoft Office document files contain a huge amount of metadata. They are created as OLE Compound Files and mainly stored in the so called property set streams. Here are some tools for processing them:
+
:* [http://jakarta.apache.org/poi/index.html Jakarta POI] Open Source implementation in Java.
+
:* [http://www.payneconsulting.com/ Payne Consulting] Metadata Analysis and cleanup.
+
:* [http://www.inforenz.com/software/forager.html Inforenz Forager] Inforenz Forager
+
  
; [[TIFF]]
+
{| class="wikitable"
: The [[Tagged Image File Format]] allows one or more images to be bundled in a single file. Multiple [[compression]] formats are supported. [[EXIF]] files can be stored inside TIFFs.
+
! align="left"| Value
:* [http://www.remotesensing.org/libtiff/ LibTIFF]
+
! Identifier
:* [http://www.awaresystems.be/imaging/tiff/faq.html TIFF FAQ]
+
! Description
 +
|-
 +
| 0x01
 +
| FTEXT
 +
| If set the uncompressed data needs to be treated as text instead of binary data. <br> This flag hints end-of-line conversion for cross-platform text files but does not enforce it.
 +
|-
 +
| 0x02
 +
| FHCRC
 +
| The file contains a header checksum (CRC-16)
 +
|-
 +
| 0x04
 +
| FEXTRA
 +
| The file contains extra fields
 +
|-
 +
| 0x08
 +
| FNAME
 +
| The file contains an original file name string
 +
|-
 +
| 0x10
 +
| FCOMMENT
 +
| The file contains comment
 +
|-
 +
| 0x20
 +
|
 +
| Reserved
 +
|-
 +
| 0x40
 +
|
 +
| Reserved
 +
|-
 +
| 0x80
 +
|
 +
| Reserved
 +
|}
  
=External links=
+
<b>Note:</b> The FHCRC bit was never set by versions of gzip up to 1.2.4, even though it was documented with a different meaning in gzip 1.2.4.
* [http://en.wikipedia.org/wiki/Metadata Wikipedia: Metadata]
+
 
* [http://theses.nps.navy.mil/08Jun_Migletz.pdf Automated Metadata Extraction],James Migletz, Master's Thesis, Naval Postgraduate School, June 2008
+
==== Extra flags ====
 +
If compression method is 8 the following extra flags can be defined:
 +
{| class="wikitable"
 +
! align="left"| Value
 +
! Identifier
 +
! Description
 +
|-
 +
| 0x02
 +
|
 +
| compressor used maximum compression, slowest algorithm
 +
|-
 +
| 0x04
 +
|
 +
| compressor used fastest algorithm
 +
|}
 +
 
 +
==== Operating System ====
 +
{| class="wikitable"
 +
! align="left"| Value
 +
! Identifier
 +
! Description
 +
|-
 +
| 0
 +
|
 +
| FAT filesystem (MS-DOS, OS/2, NT/Win32)
 +
|-
 +
| 1
 +
|
 +
| Amiga
 +
|-
 +
| 2
 +
|
 +
| VMS (or OpenVMS)
 +
|-
 +
| 3
 +
|
 +
| Unix
 +
|-
 +
| 4
 +
|
 +
| VM/CMS
 +
|-
 +
| 5
 +
|
 +
| Atari TOS
 +
|-
 +
| 6
 +
|
 +
| HPFS filesystem (OS/2, NT)
 +
|-
 +
| 7
 +
|
 +
| Macintosh
 +
|-
 +
| 8
 +
|
 +
| Z-System
 +
|-
 +
| 9
 +
|
 +
| CP/M
 +
|-
 +
| 10
 +
|
 +
| TOPS-20
 +
|-
 +
| 11
 +
|
 +
| NTFS filesystem (NT)
 +
|-
 +
| 12
 +
|
 +
| QDOS
 +
|-
 +
| 13
 +
|
 +
| Acorn RISCOS
 +
|-
 +
| 255
 +
|
 +
| unknown
 +
|}
 +
 
 +
=== Optional headers ===
 +
==== Extra fields ====
 +
<b>TODO: add description</b>
 +
 
 +
The extra field are variable of size and contains:
 +
{| class="wikitable"
 +
! align="left"| Offset
 +
! Size
 +
! Value
 +
! Description
 +
|-
 +
| 0
 +
| 2
 +
|
 +
| Extra field data size <br> Value in bytes.
 +
|-
 +
| 2
 +
| ...
 +
|
 +
| Extra field data
 +
|}
 +
 
 +
==== Original file name ====
 +
This is the original name of the file being compressed, with any directory components removed, and, if the file being compressed is on a file system with case insensitive names, forced to lower case.
 +
 
 +
Contains an ISO 8859-1 (LATIN-1) string with end-of-string character.
 +
 
 +
==== Comment ====
 +
Contains an ISO 8859-1 (LATIN-1) string with end-of-string character. Line breaks should be denoted by a single line feed character.
 +
 
 +
==== Header checksum ====
 +
The header checksum contain a CRC-16 that consists of the two least significant bytes of the CRC-32 for all bytes of the gzip header up to and not including the CRC-16.
 +
 
 +
=== File footer ===
 +
The file footer is 8 bytes in size and contains:
 +
{| class="wikitable"
 +
! align="left"| Offset
 +
! Size
 +
! Value
 +
! Description
 +
|-
 +
| 0
 +
| 4
 +
|
 +
| Checksum (CRC-32)
 +
|-
 +
| 4
 +
| 4
 +
|
 +
| Uncompressed data size <br> Value in bytes.
 +
|}
 +
 
 +
== See Also ==
 +
* [[bz2 file]]
 +
 
 +
== External Links ==
 +
 
 +
* [http://www.gzip.org/format.txt The gzip file format], by the [http://www.gzip.org/ gzip project]
 +
* [http://www.gzip.org/algorithm.txt The gzip compression algorithm], by the [http://www.gzip.org/ gzip project]
 +
* [http://tools.ietf.org/html/rfc1952 RFC1952: GZIP file format specification version 4.3], by [[IETF]]
 +
* [http://en.wikipedia.org/wiki/Gzip Wikipedia: gzip]
 +
 
 +
[[Category:File Formats]]

Revision as of 23:22, 28 November 2013

Information icon.png

Please help to improve this article by expanding it.
Further information might be found on the discussion page.

File format

The gzip file (.gz) format consists of:

  • a file header
  • optional headers
    • extra fields
    • original file name
    • comment
    • header checksum
  • a body, containing a DEFLATE-compressed payload
  • a file footer

The gzip format uses little-endian.

Offset Characteristics Description
Byte order little-endian
Date and time values Filetime in UTC
Character string ISO 8859-1 (LATIN-1)

File header

The file header is 10 bytes in size and contains:

Offset Size Value Description
0 2 0x1f 0x8b Signature (or identification byte 1 and 2)
2 1 Compression Method
3 1 Flags
4 4 Last modification time
Contains a POSIX timestamp.
8 1 Extra flags
9 1 Operating system
Value that indicates on which operating system the gzip file was created.

Compression method

Value Identifier Description
0 - 7 Reserved
8 "deflate" zlib compressed data

Flags

Value Identifier Description
0x01 FTEXT If set the uncompressed data needs to be treated as text instead of binary data.
This flag hints end-of-line conversion for cross-platform text files but does not enforce it.
0x02 FHCRC The file contains a header checksum (CRC-16)
0x04 FEXTRA The file contains extra fields
0x08 FNAME The file contains an original file name string
0x10 FCOMMENT The file contains comment
0x20 Reserved
0x40 Reserved
0x80 Reserved

Note: The FHCRC bit was never set by versions of gzip up to 1.2.4, even though it was documented with a different meaning in gzip 1.2.4.

Extra flags

If compression method is 8 the following extra flags can be defined:

Value Identifier Description
0x02 compressor used maximum compression, slowest algorithm
0x04 compressor used fastest algorithm

Operating System

Value Identifier Description
0 FAT filesystem (MS-DOS, OS/2, NT/Win32)
1 Amiga
2 VMS (or OpenVMS)
3 Unix
4 VM/CMS
5 Atari TOS
6 HPFS filesystem (OS/2, NT)
7 Macintosh
8 Z-System
9 CP/M
10 TOPS-20
11 NTFS filesystem (NT)
12 QDOS
13 Acorn RISCOS
255 unknown

Optional headers

Extra fields

TODO: add description

The extra field are variable of size and contains:

Offset Size Value Description
0 2 Extra field data size
Value in bytes.
2 ... Extra field data

Original file name

This is the original name of the file being compressed, with any directory components removed, and, if the file being compressed is on a file system with case insensitive names, forced to lower case.

Contains an ISO 8859-1 (LATIN-1) string with end-of-string character.

Comment

Contains an ISO 8859-1 (LATIN-1) string with end-of-string character. Line breaks should be denoted by a single line feed character.

Header checksum

The header checksum contain a CRC-16 that consists of the two least significant bytes of the CRC-32 for all bytes of the gzip header up to and not including the CRC-16.

File footer

The file footer is 8 bytes in size and contains:

Offset Size Value Description
0 4 Checksum (CRC-32)
4 4 Uncompressed data size
Value in bytes.

See Also

External Links