Difference between pages "Frag find" and "Arabic PDFs"

From Forensics Wiki
(Difference between pages)
Jump to: navigation, search
m
 
m
 
Line 1: Line 1:
{{Infobox_Software |
+
Modern PDFs essentially describe the result of a 19th century-style metal-based typesetting process. When typesetting Arabic, Unicode is used as a glyph list, rather than a character list. The glyphs are used as indexes into a huge font book.  
  name = frag_find |
+
  maintainer = [[Simson Garfinkel]] |
+
  os = {{Linux}}, {{MacOS}}, {{FreeBSD}} |
+
  genre = [[Carving]] |
+
  license = {{Public Domain}} |
+
  website = http://www.afflib.org/
+
}}
+
  
 +
By interpreting the Unicode standard as a look-up for glyph indexes, Unicode is abused as if it were a huge font book. This confuses multi-lingual encoding with computer typography. Unicode is set up as a catalogue of nominal characters, independent and irrespective of the (computer) typographical consequences.
  
 +
An underlying cause for this error is the idea that there can be such a thing as a Character-Glyph model. However, in the real world there is no connection between abstract characters and the glyphs used to represent them.
  
frag_find is a program for finding blocks of a TARGET file in a disk
+
Increasingly font designers are discovering the enormous conceptual freedom one gets without any Character-Glyph constraint. But Adobe still uses the Unicode standard to extract the nominal character values from the font glyph numbers used to represent them. That is why more advanced Arabic fonts that do not use the Unicode Presentation Blocks produce gibberish when text is extracted from the PDF.
IMAGE file.  This is useful in cases where a TARGET file has been
+
stolen and you wish to establish that the file has been present on a
+
subject's drive. If most of the TARGET file's sectors are found on the
+
IMAGE drive---and if the sectors are in consecutive sector
+
runs---then the chances are excellent that the file was once there.
+
  
The idea of using individual sector hashes in this manner has been
+
Future versions of PDF are planned to embed Unicode as text in addition to the font information, which would resolve this issue.
discussed in the forensic community for several years. Frag_find is an
+
efficient and easy-to-use tool that perform this process.
+
  
frag_find relies on two observations about files and file systems:
+
Part of the problem is that Unicode’s Arabic Presentation Blocks are officially deprecated by the Unicode Consortium. Their inclusion was at the time – late 1980’s - a technical compromise to allow the ISO 10646 to join Unicode. As such the compromise was incomplete, as only 400 out of originally 4000 requested Arabic ligatures we allowed to remain in the Unicode Standard. Ironically, all the printed examples in the Unicode standard were designed by Thomas Milo based on computer-generated synthesis of the underlying letter block fusions of traditional Arabic "Script Grammar". This was done using  DecoType’s famous ACE technology, that eventually became the working model for Microsoft’s True Type Open, the precursor of today's OpenType.
 +
 +
Arabic Presentation Forms should never be encoded, such a practice amounts to reverting to Font Pages, whose very proliferation caused the development of a more intelligent alternative: Unicode.
  
# Most file systems tend to block-align files stored within the file  system. So if you break up an 8K file into 16 different 512-byte    blocks, then store that file in a file system, it's likely that those 16 different "file blocks" will be stored each in its own    individual disk sector.
+
==References==
# Most 512-byte blocks within most files are "unique" --- that is,  they do not appear by chance in other files. This is especially  true for files that are compressed (like zip and docx files) and    files that are encrypted. It is less true of files such as    Microsoft Word doc files that are likely to have one or more    blocks filled with NULLs or some other constant.
+
*http://www.river-valley.tv/conferences/arabic_typography_2008/
 
+
*http://www.river-valley.tv/conferences/non_latintypefacedesign/
frag_find deals with the problem of non-unique blocks by looking    for runs of matching blocks, rather than individual blocks.
+
 
+
 
+
frag_find is fast because:
+
 
+
* Initial filtering of presence/absence is done using the NPS Bloom    filter implementation, an efficient memory-mapped Bloom    implementation designed to be used with hash functions.
+
* Hashes are stored in efficient C++ structures.
+
* All computations are done in binary, rather than hex.
+
 
+
==OPTIONS==
+
 
+
The following options are available:
+
  -b blocksize  - sets the blocksize (default is 512 bytes).
+
  -s <start>    - start the image scan at <start> (default is start
+
                    of image)
+
  -e <end>      - stop the image scan at <end> (default is end of image)  
+
  -r            - prints the raw association map, in addition to the cleaned one
+
 
+
 
+
==MEMORY USAGE==
+
 
+
frag_find uses 512MB of RAM for the Bloom filter, approximately 1MB
+
of RAM for bookkeeping, and roughly 64 bytes for every block of the
+
target file.
+
 
+
 
+
 
+
==AVAILABILITY==
+
 
+
frag_find is part of the NPS Bloom package, which can be downloaded
+
from http://www.afflib.org/.
+
 
+
The current version is:
+
 
+
    http://www.afflib.org/downloads/bloom-1.0.0.tar.gz
+
 
+
 
+
Just type ./configure && make && make install
+
 
+
 
+
==LICENSE==
+
The NPS Bloom Filter implementation is Public Domain.
+

Revision as of 10:41, 7 March 2009

Modern PDFs essentially describe the result of a 19th century-style metal-based typesetting process. When typesetting Arabic, Unicode is used as a glyph list, rather than a character list. The glyphs are used as indexes into a huge font book.

By interpreting the Unicode standard as a look-up for glyph indexes, Unicode is abused as if it were a huge font book. This confuses multi-lingual encoding with computer typography. Unicode is set up as a catalogue of nominal characters, independent and irrespective of the (computer) typographical consequences.

An underlying cause for this error is the idea that there can be such a thing as a Character-Glyph model. However, in the real world there is no connection between abstract characters and the glyphs used to represent them.

Increasingly font designers are discovering the enormous conceptual freedom one gets without any Character-Glyph constraint. But Adobe still uses the Unicode standard to extract the nominal character values from the font glyph numbers used to represent them. That is why more advanced Arabic fonts that do not use the Unicode Presentation Blocks produce gibberish when text is extracted from the PDF.

Future versions of PDF are planned to embed Unicode as text in addition to the font information, which would resolve this issue.

Part of the problem is that Unicode’s Arabic Presentation Blocks are officially deprecated by the Unicode Consortium. Their inclusion was at the time – late 1980’s - a technical compromise to allow the ISO 10646 to join Unicode. As such the compromise was incomplete, as only 400 out of originally 4000 requested Arabic ligatures we allowed to remain in the Unicode Standard. Ironically, all the printed examples in the Unicode standard were designed by Thomas Milo based on computer-generated synthesis of the underlying letter block fusions of traditional Arabic "Script Grammar". This was done using DecoType’s famous ACE technology, that eventually became the working model for Microsoft’s True Type Open, the precursor of today's OpenType.

Arabic Presentation Forms should never be encoded, such a practice amounts to reverting to Font Pages, whose very proliferation caused the development of a more intelligent alternative: Unicode.

References