Difference between pages "PDF" and "File:13-bb9320-HeatShieldRemoved.jpg"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
Line 1: Line 1:
The '''Portable Document Format''' ('''PDF''') is a document format from [[Adobe]] Inc. It is widely available on the web. Originally developed as a propriety format, version 1.7 was released as an open standard in 2008. The standard is published as ISO/IEC 32000-1:2008. Although an open standard, Adobe still owns patents and copyrights related to the PDF standard. Adobe has granted a worldwide royalty-free license to produce PDF software, but only if the software complies with the PDF standard.
== Format ==
It is a common misconception that PDF files are simply a collection of images, one per page.  Certainly a PDF can be formed that way (which is typical of document scanners), but in reality the document structure is much more complex.  A PDF file can contain text streams (which cam be encoded and/or compressed in dozens of ways), vector and raster images, fonts, and various interactive elements.
A PDF file comprises sections called "objects." Each object is numbered and can represent a page, a font, a data stream, etc. Each file begins with the string <tt>%PDF</tt>. Each file ends with the letters <tt>%%EOF</tt>, but there can be multiple <tt>EOF</tt>'s in a single file (this often confuses programs like [[foremost]] that search for footers).
Adobe's Acrobat software supports "incremental updates."  The standard allows this so that modifications can simply be appended to the file, leaving the original data intact.  Any new or altered object is simply appended to the end of the original file.  Deleted objects are left intact and simply marked deleted. This can potentially cause inadvertent disclosure of sensitive information.
== Metadata ==
PDF metadata can be stored in a document information dictionary or as a metadata stream, sometimes both. A metadata stream can describe the entire document or an individual component of a document. Thus, multiple metadata streams may exist in a single document, making it difficult to find all of it. Metadata streams are stored in Adobe's XML based XMP (Extensible Metadata Platform) format. Even if a PDF document is encrypted, the accompanying metadata is not required to be, and often is not, encrypted.
The metadata (or parts of it) can be extracted with [[pdfinfo]], a utility which is part of the [[xpdf]] package.
== Embedded Objects==
The PDF standard supports embedding many types of files such as images. Embedded files may contain their own metadata. You can use [[pdfimages]], part of the [[xpdf]], to extract all of the images out of a PDF file and put each in its own file.
== Subformats ==
Several related standards exist that contain subsets or supersets of the PDF standard features. These standards include
* PDF/A a simpler set of features for archiving documents, allowing for long-term reproducibility. Some scanning software saves documents in PDF/A by default.
* PDF/X for graphic arts.
* PDF/UA for universal accessibility.
* PDF/E for engineering drawings.
==PDF Software==
Due to the popularity of the PDF format, there is much software available for viewing and creating PDF documents. However, Adobe maintains a de facto monopoly on software capable of editing PDF documents.  There are quite a few tools that merge or split pdf documents, but few that can make meaningful edits.  Software such as OpenOffice.org and Inkscape can import PDF files into their native formats, where the documents can be edited and then exported back to PDF. Unfortunately, this option can be quite cumbersome.
=== PDF Tools ===
These tools are useful for analyzing PDF files:
; Origami
: http://security-labs.org/origami/
: A powerful open source framework and GUI written in Ruby. It allows for parsing and exploring pdf files and graphically browsing its contents.
; PDF Tools
: http://blog.didierstevens.com/programs/pdf-tools/
: Didier Stevens' [http://blog.didierstevens.com/2008/10/30/pdf-parserpy/ pdf-parse] and pdfid, written in Python
; pdfresurrect
: http://www.757labs.com/projects/pdfresurrect/#downloads
: Retrieves previous versions of PDF files that have changes appended with "incremental updates"
; PDFMiner
: http://www.unixuser.org/~euske/python/pdfminer/index.html
: "Python PDF parser and analyzer"
: Includes '''pdf2txt.py''' command-line tool for extracting text from PDF files, and '''dumppdf.py''' for dumping PDF objects.
; pyPdf
: http://pybrary.net/pyPdf/
: "A Pure-Python library built as a PDF toolkit."
: Will encrypt and decrypt PDF files.
: http://sourceforge.net/projects/qpdf/
: Open source, cross-platform library and set of programs to inspect and manipulate PDF files. Packaged in recent Debian based distributions.
These tools are useful for manipulating and generating PDF files:
; ReportLab Open Source PDF Library
: http://www.reportlab.com/software/opensource/
: "our proven, industry-strength PDF generating software. Programmatically create any kind of PDF document"
= See Also =
* [[Arabic PDFs]]
* [[Tools:Document Metadata Extraction]]
== External Links ==
* [http://partners.adobe.com/public/developer/pdf/index_reference.html Adobe PDF Reference]
* [http://en.wikipedia.org/wiki/PDF Wikipedia: PDF]
* [http://www.mactech.com/articles/mactech/Vol.15/15.09/PDFIntro/ Portable Document Format: An Introduction for Programmers], MacTech Magazine, Volume 15, (1999), Issue 9
* [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=51502 ISO Standard]
* [http://partners.adobe.com/public/developer/support/topic_legal_notices.html Patent Licenses]
* [http://blog.didierstevens.com/2008/04/09/quickpost-about-the-physical-and-logical-structure-of-pdf-files/ Quickpost: About the Physical and Logical Structure of PDF Files], by Didier Stevens, April 9, 2008
[[Category:File Formats]]

Latest revision as of 17:31, 30 October 2013