Difference between revisions of "Word Document (DOC)"

From Forensics Wiki
Jump to: navigation, search
(File Header)
 
(11 intermediate revisions by one user not shown)
Line 1: Line 1:
The '''DOC file format''' ('''document file format''') usually has the '''.doc''' extension. Mostly these documents belong to [[Microsoft]] [[Word]] software files. However, other text editing software can be used to display these files (including [[WordPad]], [[WordPerfect]], [[OpenOffice]] and others).
+
The '''Word Document (DOC) file format''' has the '''.doc''' extension. This file type originates from [[Microsoft Word]]. However, other word processing software can be used to display these files as well. These include:
 +
* [[WordPad]]
 +
* [[WordPerfect]]
 +
* [[OpenOffice]]
 +
* [[AbiWord]]
  
The DOC file format should not be confused with [[DOCX]].
+
The Word DOC file format should not be confused with [[DOCX]].
  
 
== MIME types ==
 
== MIME types ==
Line 18: Line 22:
 
* zz-application/zz-winassoc-doc
 
* zz-application/zz-winassoc-doc
  
== File Header ==
+
== File signature ==
  
MS Word documents of version 97 (and probably earlier) begin with the file signature (in hexadecimal) d0cf11e0a1b11ae1 .
+
[[Microsoft Word]] documents of version 97-2003 use the [[OLE Compound File]] (OLECF). These files therefore have the OLECF file signature
This signature signifies the file to be an OLE Compound File (AKA Compound Document File or Compound Binary File)
+
  
The OLE Compound File has no distinct footer and a can be considered a file containing a FAT like file system.
+
The object stream of the OLECF containing a Word document contains the string "Word.Document" with some version.
  
The Word document format is places on top of the OLE Compound File,.
+
== Word 97-2003 documents ==
  
The object stream of a word documents contains the string "Word.Document" with some version.
+
The Word Binary File format is stored in the OLECF using multiple streams:
 +
* WordDocument stream
 +
* Table stream (0Table, 1Table)
 +
* Data stream
  
 
== Encryption ==
 
== Encryption ==
  
 
Versions 97/2000 encrypt documents with a very weak algorithm. This password scheme can be broken easily by several different products and it is possible to decrypt the contents without discovering the password. This is done by testing all 1,099,511,627,776 possible keys. Ultimate Zip Cracker by VDGSoftware is one utility that can perform this decryption.
 
Versions 97/2000 encrypt documents with a very weak algorithm. This password scheme can be broken easily by several different products and it is possible to decrypt the contents without discovering the password. This is done by testing all 1,099,511,627,776 possible keys. Ultimate Zip Cracker by VDGSoftware is one utility that can perform this decryption.
== See Also==
 
[[Media:Compdocfileformat.pdf|Microsoft Compound Document File Format]]
 
  
 
== Extracting Strings ==
 
== Extracting Strings ==
Line 44: Line 48:
  
 
(where /tmp/test.doc is the path to your .doc file)
 
(where /tmp/test.doc is the path to your .doc file)
 +
 +
Note that a Word 97 and later document can contain both extended ASCII with codepage 1252 (codepage 1252 compressed text) and UTF-16 little-endian text. Word document can also contain 'East Asian' or 'Complex script' languages. Also the text stream contains information about all the parts of the Word document (header/footer, foot/endnote, annotation, etc.)
 +
Therefore using basic Unix string is very rough approach of finding data in a Word document. Use the wvtools or more sophisticated tools instead.
 +
 +
== External Links ==
 +
* [http://download.microsoft.com/download/0/B/E/0BE8BDD7-E5E8-422A-ABFD-4342ED7AD886/Word97-2007BinaryFileFormat(doc)Specification.pdf Word 97-2007 Binary File Format by Microsoft]
 +
  
 
[[Category:File Formats]]
 
[[Category:File Formats]]

Latest revision as of 00:17, 11 August 2012

The Word Document (DOC) file format has the .doc extension. This file type originates from Microsoft Word. However, other word processing software can be used to display these files as well. These include:

The Word DOC file format should not be confused with DOCX.

Contents

MIME types

The following MIME types apply to this file format:

  • application/msword
  • application/doc
  • appl/text
  • application/vnd.msword
  • application/vnd.ms-word
  • application/winword
  • application/word
  • application/x-msw6
  • application/x-msword
  • zz-application/zz-winassoc-doc

File signature

Microsoft Word documents of version 97-2003 use the OLE Compound File (OLECF). These files therefore have the OLECF file signature

The object stream of the OLECF containing a Word document contains the string "Word.Document" with some version.

Word 97-2003 documents

The Word Binary File format is stored in the OLECF using multiple streams:

  • WordDocument stream
  • Table stream (0Table, 1Table)
  • Data stream

Encryption

Versions 97/2000 encrypt documents with a very weak algorithm. This password scheme can be broken easily by several different products and it is possible to decrypt the contents without discovering the password. This is done by testing all 1,099,511,627,776 possible keys. Ultimate Zip Cracker by VDGSoftware is one utility that can perform this decryption.

Extracting Strings

On a unix-like machine try this command to extract strings from a .doc file:

cat /tmp/test.doc | tr -d \\0 | strings | more

(where /tmp/test.doc is the path to your .doc file)

Note that a Word 97 and later document can contain both extended ASCII with codepage 1252 (codepage 1252 compressed text) and UTF-16 little-endian text. Word document can also contain 'East Asian' or 'Complex script' languages. Also the text stream contains information about all the parts of the Word document (header/footer, foot/endnote, annotation, etc.) Therefore using basic Unix string is very rough approach of finding data in a Word document. Use the wvtools or more sophisticated tools instead.

External Links