Approximate Matching

From Forensics Wiki
Revision as of 10:11, 24 May 2013 by Simsong (Talk | contribs)

Jump to: navigation, search

Similarity is a term used in computer forensics to mean that two objects have similar contents but are not identically the same.

The following two paragraphs are clearly similar but not identical:

We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defence, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.
We the People of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.

In forensics there are several kinds of similarity that are of interest:

  1. Binary Similarity
  2. Textual Similarity
  3. Visual Similarity
  4. Audible Similarity
  5. Algorithmic (code) Similarity

Binary Similarity

Binary Similarity between a master object and a target objectcan be rigorously defined as the fraction of substrings that the two documents have in common divided by the total number of substrings in the master document. Notice that this implies that the similarity function does not have the commutative property. That is, BS(a,b) may not equal BS(b,a).

There are several applications for a binary similarity function:

  1. Determining that a master object is embedded in the target object.
  2. Determining if the target object is derived from the target object.

The leading similarity systems in use are are:

  • sdhash, developed by Vassil Roussev.
  • ssdeep, the first widely used binary similarity algorithm. Developed by Jesse Kornblum, this system uses a piecewise hash comparison algorithm originally developed for anti-spam systems.

Text Similarity

The leading text similarity system is:

  • sdtext, developed by Clay Sheilds.