Difference between revisions of "Word Document (DOCX)"

From ForensicsWiki
Jump to: navigation, search
Line 1: Line 1:
DOCX is the file format for Microsoft Office 2007 and later.
+
DOCX is the file format for Microsoft Office 2007 and later.  
  
 
DOCX should not be confused with [[DOC]], the format used by earlier versions of Microsoft Office.
 
DOCX should not be confused with [[DOC]], the format used by earlier versions of Microsoft Office.
Line 5: Line 5:
 
= Container Format =
 
= Container Format =
  
DOCX consists of a [[ZIP archive]] file containing [[XML]] and binaries. Content can be analysed without modification by unzipping the file (e.g. in WinZIP) and analysing the contents of the archive.
+
DOCX is written in an OpenXML format, which consists of a [[ZIP archive]] file containing [[XML]] and binaries. Content can be analysed without modification by unzipping the file (e.g. in WinZIP) and analysing the contents of the archive.
 +
 
 +
The file _rels/.rels contains information about the structure of the document.  It contains paths to the metadata information as well as the main XML document that contains the content of the document itself.
 +
 
 +
Metadata information are usually stored in the folder docProps.  Two or more XML files are stored inside that folder, app.xml that stores metadata information extracted from the Word application itself and core.xml that stores metadata from the document itself, such as the author name, last time it was printed, etc.
 +
 
 +
Another folder contains the actual content of the document, in a Word document, or an .docx document the folder's name is word.  A XML file called document.xml is the main document, containing most of the content of the document itself.
  
 
= Relationship to OOXML =
 
= Relationship to OOXML =
Line 14: Line 20:
  
 
= External Links =
 
= External Links =
 +
 +
* [http://msdn.microsoft.com/en-us/library/aa338205.aspx Information from Microsoft about the structure of OpenXML documents]
  
 
* [http://www.simson.net/clips/academic/2009.IEEE.DOCX.pdf The new XML Office Document Files: Implications For Forensics], [[Simson L. Garfinkel]] and James Migletz
 
* [http://www.simson.net/clips/academic/2009.IEEE.DOCX.pdf The new XML Office Document Files: Implications For Forensics], [[Simson L. Garfinkel]] and James Migletz
  
 +
* [http://blog.kiddaland.net/2009/07/antiword-for-office-2007/ Perl script that displays the content of a Docx document, similar to Antiword]
 +
 +
* [http://blog.kiddaland.net/2009/06/office-2007-metadata/ Perl script that displays metadata information that is extracted from an OpenXML document]
 
[[Category:File Formats]]
 
[[Category:File Formats]]

Revision as of 16:57, 27 August 2009

DOCX is the file format for Microsoft Office 2007 and later.

DOCX should not be confused with DOC, the format used by earlier versions of Microsoft Office.

Container Format

DOCX is written in an OpenXML format, which consists of a ZIP archive file containing XML and binaries. Content can be analysed without modification by unzipping the file (e.g. in WinZIP) and analysing the contents of the archive.

The file _rels/.rels contains information about the structure of the document. It contains paths to the metadata information as well as the main XML document that contains the content of the document itself.

Metadata information are usually stored in the folder docProps. Two or more XML files are stored inside that folder, app.xml that stores metadata information extracted from the Word application itself and core.xml that stores metadata from the document itself, such as the author name, last time it was printed, etc.

Another folder contains the actual content of the document, in a Word document, or an .docx document the folder's name is word. A XML file called document.xml is the main document, containing most of the content of the document itself.

Relationship to OOXML

For most purposes OOXML may be considered a subset of DOCX (DOCX contains additional features, like OLE serialization).

Documentation on OOXML may provide a guide to analysing a DOCX file.

External Links