Difference between revisions of "File Format Identification"

From ForensicsWiki
Jump to: navigation, search
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
File Format Identification is the process of figuring out the format of a sequence of bytes. Operating systems typically do this by file extension or by embedded MIME information. Forensic applications need to identify file types by content.
 
File Format Identification is the process of figuring out the format of a sequence of bytes. Operating systems typically do this by file extension or by embedded MIME information. Forensic applications need to identify file types by content.
 +
  
 
=Tools=
 
=Tools=
Line 8: Line 9:
 
* http://sourceforge.net/projects/libmagic
 
* http://sourceforge.net/projects/libmagic
  
==DROID==
+
==Digital Preservation Efforts==
* Writen in Java
+
PRONOM is  a project of the National Archives of the United Kingdom to develop a registry of file types. A similar project was started by JSTOR and Harvard as the JSTOR/Harvard Object Validation Environment. Attempts are now underway to merge these two efforts in the Global Digital Format Registry and the Universal Digital Format Registry.
* Developed by National Archives of the United Kingdom.
+
 
* http://droid.sourceforge.net
+
The UK National Archives developed the Digital Record Object Identification (DROID) tool, an "automatic file format identification tool." This tool is written in Java and can be downloaded from SourgeForge.
 +
 
 +
See:
 +
* [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx  PRONOM]
 +
* [http://hul.harvard.edu/jhove/ JHOVE]
 +
* [https://wiki.ucop.edu/display/JHOVE2Info/Home JHOVE2]
 +
* [http://www.gdfr.info/  GDFR]
 +
* [http://www.udfr.org/  UDFR]
 +
* [http://droid.sourceforge.net DROID download]
  
 
==TrID==
 
==TrID==
Line 32: Line 41:
 
* Proprietary.
 
* Proprietary.
 
* Provides detection of password protected archives, some files of cryptographic programs, Pinch/Zeus binary reports, etc.
 
* Provides detection of password protected archives, some files of cryptographic programs, Pinch/Zeus binary reports, etc.
 
+
* http://nhtcu.ru/0xFA_eng.html
 
[[Category:Tools]]
 
[[Category:Tools]]
 +
 +
=Data Sets=
 +
If you are working in the field of file format identification, please consider reporting the results of your algorithm with one of these publicly available data sets:
 +
* NPS govdocs1m - a corpus of 1 million files that can be redistributed without concern of copyright or PII. Download from http://domex.nps.edu/corp/files/govdocs1/
 +
* The NPS Disk Corpus - a corpus of realistic disk images that contain no PII. Information is at: http://digitalcorpora.org/?s=nps
  
 
=Bibliography=
 
=Bibliography=
 +
Current research papers on the file format identification problem. Most of these papers concern themselves with identifying file format of a few file sectors, rather than an entire file.  '''Please note that this bibliography is in chronological order!'''
 +
  
 
;2001
 
;2001
 
Current research papers on the file format identification problem. Most of these papers concern themselves with identifying file format of a few file sectors, rather than an entire file.  '''Please note that this bibliography is in chronological order!'''
 
  
 
* Mason McDaniel, [[Media:Mcdaniel01.pdf|Automatic File Type Detection Algorithm]], Masters Thesis, James Madison University,2001
 
* Mason McDaniel, [[Media:Mcdaniel01.pdf|Automatic File Type Detection Algorithm]], Masters Thesis, James Madison University,2001
Line 49: Line 63:
 
; 2005
 
; 2005
  
* [http://www1.cs.columbia.edu/ids/publications/FilePrintPaper-revised.pdf Fileprints: identifying file types by n-gram analysis], LiWei-Jen, Wang Ke, Stolfo SJ, Herzog B..,  IProceeding of the 2005 IEEE workshop on information assurance; 2005 [http://www.itoc.usma.edu/workshop/2005/Papers/Follow%20ups/FilePrintPresentation-final.pdf [slides]]
+
* Fileprints: identifying file types by n-gram analysis, LiWei-Jen, Wang Ke, Stolfo SJ, Herzog B..,  IProceeding of the 2005 IEEE workshop on information assurance, 2005. ([http://www.itoc.usma.edu/workshop/2005/Papers/Follow%20ups/FilePrintPresentation-final.pdf Presentation Slides])  ([http://www1.cs.columbia.edu/ids/publications/FilePrintPaper-revised.pdf PDF])
  
* [http://www.micsymposium.org/mics_2005/papers/paper7.pdf File Type Detection Technology], Douglas J. Hickok, Daine Richard Lesniak, Michael C. Rowe, 2005 Midwest Instruction and Computing Symposium.
+
* Douglas J. Hickok, Daine Richard Lesniak, Michael C. Rowe, File Type Detection Technology,  2005 Midwest Instruction and Computing Symposium.([http://www.micsymposium.org/mics_2005/papers/paper7.pdf PDF])
  
 
; 2006
 
; 2006
  
* [http://ieeexplore.ieee.org/iel5/10992/34632/01652088.pdf  File type identification of data fragments by their binary structure. ], Karresand Martin, Shahmehri Nahid. Proceedings of the IEEE workshop on information assurance; 2006. p. 140–7. [http://www.itoc.usma.edu/workshop/2006/Program/Presentations/IAW2006-07-3.pdf [slides]]
+
* Karresand Martin, Shahmehri Nahid [http://ieeexplore.ieee.org/iel5/10992/34632/01652088.pdf  File type identification of data fragments by their binary structure. ], Proceedings of the IEEE workshop on information assurance, pp.140–147, 2006.([http://www.itoc.usma.edu/workshop/2006/Program/Presentations/IAW2006-07-3.pdf Presentation Slides])
  
* [http://www.mantechcfia.com/SlidingWindowMeasurementforFileTypeIdentification.pdf Sliding Window Measurement for File Type Identification], Gregory A. Hall, Ph.D., Computer Forensics and Intrusion Analysis Group, ManTech Security and Mission Assurance, 2006
+
* Gregory A. Hall, Sliding Window Measurement for File Type Identification, Computer Forensics and Intrusion Analysis Group, ManTech Security and Mission Assurance, 2006. ([http://www.mantechcfia.com/SlidingWindowMeasurementforFileTypeIdentification.pdf PDF])
  
 
* FORSIGS; Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints, John Haggerty and Mark Taylor, IFIP TC11 International Information Security Conference, 2006, Sandton, South Africa.
 
* FORSIGS; Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints, John Haggerty and Mark Taylor, IFIP TC11 International Information Security Conference, 2006, Sandton, South Africa.
  
* Oscar -- Using Byte Pairs to Find File Type and Camera Make of Data Fragments, Martin Karresand , Nahid Shahmehri, Annual Workshop on Digital Forensics and Incident Analysis ( 2006 : Pontypridd, Wales, UK ) , s. 85 - 94, London, UK : Springer-Verlag, 2006
+
* Martin Karresand , Nahid Shahmehri, "Oscar -- Using Byte Pairs to Find File Type and Camera Make of Data Fragments," Annual Workshop on Digital Forensics and Incident Analysis, Pontypridd, Wales, UK, pp.85-94, Springer-Verlag, 2006.
  
 
; 2007
 
; 2007
  
* Karresand M., Shahmehri N., [http://dx.doi.org/10.1007/0-387-33406-8_35 Oscar: File Type Identification of Binary Data in Disk Clusters and RAM Pages], Proceedings of IFIP International Information Security Conference: Security and Privacy in Dynamic Environments (SEC2006), Springer, ISBN 0-387-33405-x, pp 413-424, May 22 - 24, Karlstad, Sweden.
+
* Karresand M., Shahmehri N., [http://dx.doi.org/10.1007/0-387-33406-8_35 Oscar: File Type Identification of Binary Data in Disk Clusters and RAM Pages], Proceedings of IFIP International Information Security Conference: Security and Privacy in Dynamic Environments (SEC2006), Springer, ISBN 0-387-33405-x, pp.413-424, Karlstad, Sweden, May 2006.
  
* "Identification and Localization of Data Types within Large-Scale File Systems," Robert F. Erbacher and John Mulholland,, Proceedings of the 2nd International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, WA, April 2007,
+
* Robert F. Erbacher and John Mulholland, "Identification and Localization of Data Types within Large-Scale File Systems," Proceedings of the 2nd International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, WA, April 2007.
  
* [https://www.cerias.purdue.edu/tools_and_resources/bibtex_archive/archive/2007-19.pdf Using Artificial Neural Networks for Forensic File Type Identification], Ryan M. Harris, Master's Thesis, Purdue University, May 2007
+
* Ryan M. Harris, "Using Artificial Neural Networks for Forensic File Type Identification," Master's Thesis, Purdue University, May 2007. ([https://www.cerias.purdue.edu/tools_and_resources/bibtex_archive/archive/2007-19.pdf PDF])
  
* [http://www.dfrws.org/2008/proceedings/p14-calhoun.pdf Predicting the Types of File Fragments], William Calhoun, Drue Coles, DFRWS 2008 [http://www.dfrws.org/2008/proceedings/p14-calhoun_pres.pdf [slides]]
+
* Predicting the Types of File Fragments, William Calhoun, Drue Coles, DFRWS 2008. ([http://www.dfrws.org/2008/proceedings/p14-calhoun_pres.pdf Presentation Slides])  ([http://www.dfrws.org/2008/proceedings/p14-calhoun.pdf PDF])
  
* [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04545366 SÁDI – Statistical Analysis for Data type Identification], Sarah J. Moody and Robert F. Erbacher, 3rd International Workshop on Systematic Approaches to Digital Forensic Engineering, Third International Workshop on Systematic Approaches to Digital Forensic Engineering, 2008]
+
* Sarah J. Moody and Robert F. Erbacher, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04545366 SÁDI – Statistical Analysis for Data type Identification], 3rd International Workshop on Systematic Approaches to Digital Forensic Engineering, 2008.
  
 
; 2008
 
; 2008
  
* Mehdi Chehel Amirani, Mohsen Toorani, and Ali Asghar Beheshti Shirazi, [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4625611 A New Approach to Content-based File Type Detection], Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC'08), pp.1103-1108, IEEE ComSoc, Marrakech, Morocco, July 2008.
+
* Mehdi Chehel Amirani, Mohsen Toorani, and Ali Asghar Beheshti Shirazi, [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4625611 A New Approach to Content-based File Type Detection], Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC'08), pp.1103-1108, July 2008. ([http://arxiv.org/ftp/arxiv/papers/1002/1002.3174.pdf PDF])
  
 
; 2009
 
; 2009
*Roussev, Vassil, and Garfinkel, Simson, [http://simson.net/clips/academic/2009.SADFE.Fragments.pdf File Classification Fragment---The Case for Specialized Approaches], Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California.
+
* Roussev, Vassil, and Garfinkel, Simson, "File Classification Fragment-The Case for Specialized Approaches," Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California. ([http://simson.net/clips/academic/2009.SADFE.Fragments.pdf PDF])
 +
 
 +
* Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, [http://www.springerlink.com/content/g2655k2044615q75/ On Improving the Accuracy and Performance of Content-based File Type Identification], Proceedings of the 14th Australasian Conference on Information Security and Privacy (ACISP 2009), pp.44-59, LNCS (Springer), Brisbane, Australia, July 2009.
  
*Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, [http://www.springerlink.com/content/g2655k2044615q75/ On improving the accuracy and performance of content-based file type identification], Proceedings of the 14th Australasian Conference on Information Security and Privacy (ACISP 2009), pp.44-59, LNCS (Springer), Brisbane, Australia, July 2009
+
; 2010
 +
*Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, [http://www.alphaminers.net/sub05/sub05_03.php?swf_pn=5&swf_sn=3&swf_pn2=3    Fast File-type Identification], Proceedings of the 25th ACM Symposium on Applied Computing (ACM SAC 2010), ACM, Sierre, Switzerland, March 2010.
  
 +
;2011
 +
*Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, Man-Pyo Hong, [http://link.springer.com/chapter/10.1007/978-3-642-24212-0_5 Fast Content-Based File Type Identification], Proceedings of the 7th Annual IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, February, 2011
 
[[Category:Bibliographies]]
 
[[Category:Bibliographies]]

Latest revision as of 12:06, 17 April 2013

File Format Identification is the process of figuring out the format of a sequence of bytes. Operating systems typically do this by file extension or by embedded MIME information. Forensic applications need to identify file types by content.


Tools

libmagic

  • Written in C.
  • Rules in /usr/share/file/magic and compiled at runtime.
  • Powers the Unix “file” command, but you can also call the library directly from a C program.
  • http://sourceforge.net/projects/libmagic

Digital Preservation Efforts

PRONOM is a project of the National Archives of the United Kingdom to develop a registry of file types. A similar project was started by JSTOR and Harvard as the JSTOR/Harvard Object Validation Environment. Attempts are now underway to merge these two efforts in the Global Digital Format Registry and the Universal Digital Format Registry.

The UK National Archives developed the Digital Record Object Identification (DROID) tool, an "automatic file format identification tool." This tool is written in Java and can be downloaded from SourgeForge.

See:

TrID

Forensic Innovations File Investigator TOOLS

  • Proprietary, but free trial available.
  • Available as consumer applications and OEM API.
  • Identifies 3,000+ file types, using multiple methods to maintain high accuracy.
  • Extracts metadata for many of the supported file types.
  • http://www.forensicinnovations.com/fitools.html

Stellent/Oracle Outside-In

Forensic Assistant

  • Proprietary.
  • Provides detection of password protected archives, some files of cryptographic programs, Pinch/Zeus binary reports, etc.
  • http://nhtcu.ru/0xFA_eng.html

Data Sets

If you are working in the field of file format identification, please consider reporting the results of your algorithm with one of these publicly available data sets:

Bibliography

Current research papers on the file format identification problem. Most of these papers concern themselves with identifying file format of a few file sectors, rather than an entire file. Please note that this bibliography is in chronological order!


2001
2003
2005
  • Fileprints: identifying file types by n-gram analysis, LiWei-Jen, Wang Ke, Stolfo SJ, Herzog B.., IProceeding of the 2005 IEEE workshop on information assurance, 2005. (Presentation Slides) (PDF)
  • Douglas J. Hickok, Daine Richard Lesniak, Michael C. Rowe, File Type Detection Technology, 2005 Midwest Instruction and Computing Symposium.(PDF)
2006
  • Gregory A. Hall, Sliding Window Measurement for File Type Identification, Computer Forensics and Intrusion Analysis Group, ManTech Security and Mission Assurance, 2006. (PDF)
  • FORSIGS; Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints, John Haggerty and Mark Taylor, IFIP TC11 International Information Security Conference, 2006, Sandton, South Africa.
  • Martin Karresand , Nahid Shahmehri, "Oscar -- Using Byte Pairs to Find File Type and Camera Make of Data Fragments," Annual Workshop on Digital Forensics and Incident Analysis, Pontypridd, Wales, UK, pp.85-94, Springer-Verlag, 2006.
2007
  • Robert F. Erbacher and John Mulholland, "Identification and Localization of Data Types within Large-Scale File Systems," Proceedings of the 2nd International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, WA, April 2007.
  • Ryan M. Harris, "Using Artificial Neural Networks for Forensic File Type Identification," Master's Thesis, Purdue University, May 2007. (PDF)
  • Predicting the Types of File Fragments, William Calhoun, Drue Coles, DFRWS 2008. (Presentation Slides) (PDF)
2008
2009
  • Roussev, Vassil, and Garfinkel, Simson, "File Classification Fragment-The Case for Specialized Approaches," Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California. (PDF)
2010
  • Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, Fast File-type Identification, Proceedings of the 25th ACM Symposium on Applied Computing (ACM SAC 2010), ACM, Sierre, Switzerland, March 2010.
2011
  • Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, Man-Pyo Hong, Fast Content-Based File Type Identification, Proceedings of the 7th Annual IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, February, 2011