Difference between pages "Arabic PDFs" and "Open Document Format"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
m
 
(Created ODF page, more to come.)
 
Line 1: Line 1:
This page discusses issues that arise when working with Adobe PDF files that contain Arabic.  
+
'''Open Document Format''' (ODF) is an open, XML-based file format standard for word processing documents, spreadsheets, charts, and presentations. The specification was originally developed by Sun Microsystems, but has been standardized by the Organization for the Advancement of Structured Information Standards (OASIS). ODF version 1.0 has been standardized as ISO/IEC 26300:2006. ODF is the primary format for the OpenOffice.org office suite.
  
==Glyphs vs. Characters==
+
=File Extensions=
The term ''chracter'' describes an abstract concept of a letter. The term ''glyph'' describes how a character prints. A single character can have multiple glyphs (for example, glyphs with serifs and those without). A single glyph can have multiple characters (for example, a lower case ''l'' and a capital "I" in Helvetica).
+
The main file extensions for ODF documents are
 +
* .odt for word processing documents
 +
* .ods for spreadsheet documents
 +
* .odp for presentation documents
 +
* .odb for database documents
 +
* .odg for graphical documents
 +
* .odf for mathematical formulae
  
PDFs can contain glyphs or characters.
+
ODF also supports template files for each type of document.  The 'd' in file extension is replaced by a 't' for template files.
  
Unicode is set up as a catalogue of nominal characters, independent and irrespective of the (computer) typographical consequences.
+
=File Structure=
 +
An ODF document can be as simple as a single XML file.  However, this is rarely practical. The standard specifies that an ODF file can also be stored as a collection of several subdocuments.  The latter is the most common implementation.
  
Modern PDFs essentially describe the result of a 19th century-style metal-based typesetting process. Ideally PDFs should encode characters, not glyphs. But when typesetting Arabic, Unicode is used as a glyph list, rather than a character list. The glyphs are used as indexes into a huge font book.
+
[Category:File Formats]
 
+
By interpreting the Unicode standard as a look-up for glyph indexes, Unicode is abused as if it were a huge font book. This confuses multi-lingual encoding with computer typography.
+
 
+
An underlying cause for this error is the idea that there can be such a thing as a Character-Glyph model. However, in the real world there is no connection between abstract characters and the glyphs used to represent them.
+
 
+
Increasingly font designers are discovering the enormous conceptual freedom one gets without any Character-Glyph constraint. But Adobe still uses the Unicode standard to extract the nominal character values from the font glyph numbers used to represent them. That is why more advanced Arabic fonts that do not use the Unicode Presentation Blocks produce gibberish when text is extracted from the PDF.
+
 
+
Future versions of PDF are planned to embed Unicode as text in addition to the font information, which would resolve this issue.
+
 
+
Part of the problem is that Unicode’s Arabic Presentation Blocks are officially deprecated by the Unicode Consortium. Their inclusion was at the time – late 1980’s - a technical compromise to allow the ISO 10646 to join Unicode. As such the compromise was incomplete, as only 400 out of originally 4000 requested Arabic ligatures we allowed to remain in the Unicode Standard. Ironically, all the printed examples in the Unicode standard were designed by Thomas Milo based on computer-generated synthesis of the underlying letter block fusions of traditional Arabic "Script Grammar". This was done using  DecoType’s famous ACE technology, that eventually became the working model for Microsoft’s True Type Open, the precursor of today's OpenType.
+
+
Arabic Presentation Forms should never be encoded, such a practice amounts to reverting to Font Pages, whose very proliferation caused the development of a more intelligent alternative: Unicode.
+
 
+
==References==
+
*http://www.river-valley.tv/conferences/arabic_typography_2008/
+
*http://www.river-valley.tv/conferences/non_latintypefacedesign/
+

Revision as of 17:45, 13 April 2010

Open Document Format (ODF) is an open, XML-based file format standard for word processing documents, spreadsheets, charts, and presentations. The specification was originally developed by Sun Microsystems, but has been standardized by the Organization for the Advancement of Structured Information Standards (OASIS). ODF version 1.0 has been standardized as ISO/IEC 26300:2006. ODF is the primary format for the OpenOffice.org office suite.

File Extensions

The main file extensions for ODF documents are

  • .odt for word processing documents
  • .ods for spreadsheet documents
  • .odp for presentation documents
  • .odb for database documents
  • .odg for graphical documents
  • .odf for mathematical formulae

ODF also supports template files for each type of document. The 'd' in file extension is replaced by a 't' for template files.

File Structure

An ODF document can be as simple as a single XML file. However, this is rarely practical. The standard specifies that an ODF file can also be stored as a collection of several subdocuments. The latter is the most common implementation.

[Category:File Formats]