Difference between pages "Bulk extractor" and "File Format Identification"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
m (Download)
 
 
Line 1: Line 1:
== Overview ==
+
File Format Identification is the process of figuring out the format of a sequence of bytes. Operating systems typically do this by file extension or by embedded MIME information. Forensic applications need to identify file types by content.
'''bulk_extractor''' is a computer forensics tool that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results can be easily inspected, parsed, or processed with automated tools. '''bulk_extractor''' also created a histograms of features that it finds, as features that are more common tend to be more important. The program can be used for law enforcement, defense, intelligence, and cyber-investigation applications.
+
  
bulk_extractor is distinguished from other forensic tools by its speed and thoroughness. Because it ignores file system structure, bulk_extractor can process different parts of the disk in parallel. In practice, the program splits the disk up into 16MiByte pages and processes one page on each available core. This means that 24-core machines process a disk roughly 24 times faster than a 1-core machine. bulk_extractor is also thorough. That’s because bulk_extractor automatically detects, decompresses, and recursively re-processes compressed data that is compressed with a variety of algorithms. Our testing has shown that there is a significant amount of compressed data in the unallocated regions of file systems that is missed by most forensic tools that are commonly in use today.
 
  
Another advantage of ignoring file systems is that bulk_extractor can be used to process any digital media. We have used the program to process hard drives, SSDs, optical media, camera cards, cell phones, network packet dumps, and other kinds of digital information.
+
=Tools=
 +
==libmagic==
 +
* Written in C.
 +
* Rules in /usr/share/file/magic and compiled at runtime.
 +
* Powers the Unix “file” command, but you can also call the library directly from a C program.
 +
* http://sourceforge.net/projects/libmagic
  
==Output Feature Files==
+
==Digital Preservation Efforts==
 +
PRONOM is  a project of the National Archives of the United Kingdom to develop a registry of file types. A similar project was started by JSTOR and Harvard as the JSTOR/Harvard Object Validation Environment. Attempts are now underway to merge these two efforts in the Global Digital Format Registry and the Universal Digital Format Registry.
  
bulk_extractor now creates an output directory that includes:
+
The UK National Archives developed the Digital Record Object Identification (DROID) tool, an "automatic file format identification tool." This tool is written in Java and can be downloaded from SourgeForge.
* '''ccn.txt''' -- Credit card numbers
+
* '''ccn_track2.txt''' -- Credit card “track 2″ information
+
* '''domain.txt''' -- Internet domains found on the drive, including dotted-quad addresses found in text.
+
* '''email.txt''' -- Email addresses
+
* '''ether.txt''' -- Ethernet MAC addresses found through IP packet carving of swap files and compressed system hibernation files and file fragments.
+
* '''exif.txt''' -- EXIFs from JPEGs and video segments. This feature file contains all of the EXIF fields, expanded as XML records.
+
* '''find.txt''' -- The results of specific regular expression search requests.
+
* '''ip.txt''' -- IP addresses found through IP packet carving.
+
* '''telephone.txt''' --- US and international telephone numbers.
+
* '''url.txt''' --- URLs, typically found in browser caches, email messages, and pre-compiled into executables.
+
* '''url_searches.txt''' --- A histogram of terms used in Internet searches from services such as Google, Bing, Yahoo, and others.
+
* '''wordlist.txt''' --- :A list of all “words” extracted from the disk, useful for password cracking.
+
* '''wordlist_*.txt''' --- The wordlist with duplicates removed, formatted in a form that can be easily imported into a popular password-cracking program.
+
* '''zip.txt''' --- A file containing information regarding every ZIP file component found on the media. This is exceptionally useful as ZIP files contain internal structure and ZIP is increasingly the compound file format of choice for a variety of products such as Microsoft Office
+
  
For each of the above, two additional files may be created:
+
See:
* '''*_stopped.txt''' --- bulk_extractor supports a stop list, or a list of items that do not need to be brought to the user’s attention. However rather than simply suppressing this information, which might cause something critical to be hidden, stopped entries are stored in the stopped files.
+
* [http://www.nationalarchives.gov.uk/PRONOM/Default.aspx  PRONOM]
* '''*_histogram.txt''' --- bulk_extractor can also create histograms of features. This is important, as experience has shown that email addresses, domain names, URLs, and other information that appear more frequently on a hard drive or in a cell phone’s memory can be used to rapidly create a pattern of life report.
+
* [http://hul.harvard.edu/jhove/ JHOVE]
 +
* [https://wiki.ucop.edu/display/JHOVE2Info/Home JHOVE2]
 +
* [http://www.gdfr.info/  GDFR]
 +
* [http://www.udfr.org/  UDFR]
 +
* [http://droid.sourceforge.net DROID download]
  
Bulk extractor also creates a file that captures the provenance of the run:
+
==TrID==
;report.xml
+
* XML config file
:A Digital Forensics XML report that includes information about the source media, how the bulk_extractor program was compiled and run, the time to process the digital evidence, and a meta report of the information that was found.
+
* Closed source; free for non-commercial use
 +
* http://mark0.net/soft-trid-e.html
  
==Post-Processing==
+
==Forensic Innovations File Investigator TOOLS==
 +
* Proprietary, but free trial available.
 +
* Available as consumer applications and OEM API.
 +
* Identifies 3,000+ file types, using multiple methods to maintain high accuracy.
 +
* Extracts metadata for many of the supported file types.
 +
* http://www.forensicinnovations.com/fitools.html
  
We have developed four programs for post-processing the bulk_extractor output:
+
==Stellent/Oracle Outside-In==
;bulk_diff.py
+
* Proprietary but free demo.
:This program reports the differences between two bulk_extractor runs. The intent is to image a computer, run bulk_extractor on a disk image, let the computer run for a period of time, re-image the computer, run bulk_extractor on the second image, and then report the differences. This can be used to infer the user’s activities within a time period.
+
* http://www.oracle.com/technology/products/content-management/oit/oit_all.html
;cda_tool.py
+
:This tool, currently under development, reads multiple bulk_extractor reports from multiple runs against multiple drives and performs a multi-drive correlation using Garfinkel’s Cross Drive Analysis technique. This can be used to automatically identify new social networks or to identify new members of existing networks.
+
;identify_filenames.py
+
:In the bulk_extractor feature file, each feature is annotated with the byte offset from the beginning of the image in which it was found. The program takes as input a bulk_extractor feature file and a DFXML file containing the locations of each file on the drive (produced with Garfinkel’s fiwalk program) and produces an annotated feature file that contains the offset, feature, and the file in which the feature was found.
+
;make_context_stop_list.py
+
:Although forensic analysts frequently make “stop lists”—for example, a lsit of email addresses that appear in the operating system and should therefore be ignored—such lists have a significant problem. Because it is relatively easy to get an email address into the binary of an open source application, ignoring all of these email addresses may make it possible to cloak email addresses from forensic analysis. Our solution is to create context-sensitive stop lists, in which the feature to be stopped is presented with the context in which it occures. The make_context_stop_list.py program takes the results of multiple bulk_extractor runs and creates a single context-sensitive stop list that can then be used to suppress features when found in a specific context. One such stop list constructed from Windows and Linux operating systems is available on the bulk extractor website.
+
  
== Download ==
+
==[[Forensic Assistant]]==
The current version of '''bulk_extractor''' is 1.3.  
+
* Proprietary.
 +
* Provides detection of password protected archives, some files of cryptographic programs, Pinch/Zeus binary reports, etc.
 +
* http://nhtcu.ru/0xFA_eng.html
 +
[[Category:Tools]]
  
* Downloads are available at: http://digitalcorpora.org/downloads/bulk_extractor/
+
=Data Sets=
* A WIndows installer with the GUI can be downloaded from: http://digitalcorpora.org/downloads/bulk_extractor/executables/be_installer-1.3.exe
+
If you are working in the field of file format identification, please consider reporting the results of your algorithm with one of these publicly available data sets:
 +
* NPS govdocs1m - a corpus of 1 million files that can be redistributed without concern of copyright or PII. Download from http://domex.nps.edu/corp/files/govdocs1/
 +
* The NPS Disk Corpus - a corpus of realistic disk images that contain no PII. Information is at: http://digitalcorpora.org/?s=nps
  
== Bibliography ==
+
=Bibliography=
=== Academic Publications ===
+
Current research papers on the file format identification problem. Most of these papers concern themselves with identifying file format of a few file sectors, rather than an entire file. '''Please note that this bibliography is in chronological order!'''
# Garfinkel, Simson Simson, [http://simson.net/clips/academic/2013.COSE.bulk_extractor.pdf Digital media triage with bulk data analysis and bulk_extractor]. Computers and Security 32: 56-72 (2013)
+
# Beverly, Robert, Simson Garfinkel and Greg Cardwell, [http://simson.net/clips/academic/2011.DFRWS.ipcarving.pdf "Forensic Carving of Network Packets and Associated Data Structures"], DFRWS 2011, Aug. 1-3, 2011, New Orleans, LA. BEST PAPER AWARD (Acceptance rate: 23%, 14/62)
+
#Garfinkel, S., [http://simson.net/clips/academic/2006.DFRWS.pdf Forensic Feature Extraction and Cross-Drive Analysis,]The 6th Annual Digital Forensic Research Workshop Lafayette, Indiana, August 14-16, 2006. (Acceptance rate: 43%, 16/37)
+
  
===YouTube===
 
'''[http://www.youtube.com/results?search_query=bulk_extractor search YouTube] for bulk_extractor videos'''
 
* [http://www.youtube.com/watch?v=odvDTGA7rYI Simson Garfinkel speaking at CERIAS about bulk_extractor]
 
* [http://www.youtube.com/watch?v=wTBHM9DeLq4 BackTrack 5 with bulk_extractor]
 
* [http://www.youtube.com/watch?v=QVfYOvhrugg Ubuntu 12.04 forensics with bulk_extractor]
 
* [http://www.youtube.com/watch?v=57RWdYhNvq8 Social Network forensics with bulk_extractor]
 
  
===Tutorials===
+
;2001
# [http://simson.net/ref/2012/2012-08-08%20bulk_extractor%20Tutorial.pdf Using bulk_extractor for digital forensics triage and cross-drive analysis], DFRWS 2012
+
 
 +
* Mason McDaniel, [[Media:Mcdaniel01.pdf|Automatic File Type Detection Algorithm]], Masters Thesis, James Madison University,2001
 +
 
 +
; 2003
 +
 
 +
* [http://www2.computer.org/portal/web/csdl/abs/proceedings/hicss/2003/1874/09/187490332a.pdf Content Based File Type Detection Algorithms], Mason McDaniel and M. Hossain Heydari, 36th Annual Hawaii International Conference on System Sciences (HICSS'03) - Track 9, 2003.
 +
 
 +
; 2005
 +
 
 +
* Fileprints: identifying file types by n-gram analysis, LiWei-Jen, Wang Ke, Stolfo SJ, Herzog B..,  IProceeding of the 2005 IEEE workshop on information assurance, 2005. ([http://www.itoc.usma.edu/workshop/2005/Papers/Follow%20ups/FilePrintPresentation-final.pdf Presentation Slides])  ([http://www1.cs.columbia.edu/ids/publications/FilePrintPaper-revised.pdf PDF])
 +
 
 +
* Douglas J. Hickok, Daine Richard Lesniak, Michael C. Rowe, File Type Detection Technology,  2005 Midwest Instruction and Computing Symposium.([http://www.micsymposium.org/mics_2005/papers/paper7.pdf PDF])
 +
 
 +
; 2006
 +
 
 +
* Karresand Martin, Shahmehri Nahid [http://ieeexplore.ieee.org/iel5/10992/34632/01652088.pdf  File type identification of data fragments by their binary structure. ], Proceedings of the IEEE workshop on information assurance, pp.140–147, 2006.([http://www.itoc.usma.edu/workshop/2006/Program/Presentations/IAW2006-07-3.pdf Presentation Slides])
 +
 
 +
* Gregory A. Hall, Sliding Window Measurement for File Type Identification, Computer Forensics and Intrusion Analysis Group, ManTech Security and Mission Assurance, 2006. ([http://www.mantechcfia.com/SlidingWindowMeasurementforFileTypeIdentification.pdf PDF])
 +
 
 +
* FORSIGS; Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints, John Haggerty and Mark Taylor, IFIP TC11 International Information Security Conference, 2006, Sandton, South Africa.
 +
 
 +
* Martin Karresand , Nahid Shahmehri, "Oscar -- Using Byte Pairs to Find File Type and Camera Make of Data Fragments," Annual Workshop on Digital Forensics and Incident Analysis, Pontypridd, Wales, UK, pp.85-94, Springer-Verlag, 2006.
 +
 
 +
; 2007
 +
 
 +
* Karresand M., Shahmehri N., [http://dx.doi.org/10.1007/0-387-33406-8_35 Oscar: File Type Identification of Binary Data in Disk Clusters and RAM Pages], Proceedings of IFIP International Information Security Conference: Security and Privacy in Dynamic Environments (SEC2006), Springer, ISBN 0-387-33405-x, pp.413-424, Karlstad, Sweden, May 2006.
 +
 
 +
* Robert F. Erbacher and John Mulholland, "Identification and Localization of Data Types within Large-Scale File Systems," Proceedings of the 2nd International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, WA, April 2007.
 +
 
 +
* Ryan M. Harris, "Using Artificial Neural Networks for Forensic File Type Identification," Master's Thesis, Purdue University, May 2007. ([https://www.cerias.purdue.edu/tools_and_resources/bibtex_archive/archive/2007-19.pdf PDF])
 +
 
 +
* Predicting the Types of File Fragments, William Calhoun, Drue Coles, DFRWS 2008. ([http://www.dfrws.org/2008/proceedings/p14-calhoun_pres.pdf Presentation Slides])  ([http://www.dfrws.org/2008/proceedings/p14-calhoun.pdf PDF])
 +
 
 +
* Sarah J. Moody and Robert F. Erbacher, [http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=04545366 SÁDI – Statistical Analysis for Data type Identification], 3rd International Workshop on Systematic Approaches to Digital Forensic Engineering, 2008.
 +
 
 +
; 2008
 +
 
 +
* Mehdi Chehel Amirani, Mohsen Toorani, and Ali Asghar Beheshti Shirazi, [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4625611 A New Approach to Content-based File Type Detection], Proceedings of the 13th IEEE Symposium on Computers and Communications (ISCC'08), pp.1103-1108, July 2008.  ([http://arxiv.org/ftp/arxiv/papers/1002/1002.3174.pdf PDF])
 +
 
 +
; 2009
 +
* Roussev, Vassil, and Garfinkel, Simson, "File Classification Fragment-The Case for Specialized Approaches," Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California. ([http://simson.net/clips/academic/2009.SADFE.Fragments.pdf PDF])
 +
 
 +
* Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, [http://www.springerlink.com/content/g2655k2044615q75/ On Improving the Accuracy and Performance of Content-based File Type Identification], Proceedings of the 14th Australasian Conference on Information Security and Privacy (ACISP 2009), pp.44-59, LNCS (Springer), Brisbane, Australia, July 2009.
 +
 
 +
; 2010
 +
*Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, [http://www.alphaminers.net/sub05/sub05_03.php?swf_pn=5&swf_sn=3&swf_pn2=3    Fast File-type Identification], Proceedings of the 25th ACM Symposium on Applied Computing (ACM SAC 2010), ACM, Sierre, Switzerland, March 2010.
 +
 
 +
;2011
 +
*Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, Man-Pyo Hong, [http://link.springer.com/chapter/10.1007/978-3-642-24212-0_5 Fast Content-Based File Type Identification], Proceedings of the 7th Annual IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, February, 2011
 +
[[Category:Bibliographies]]

Revision as of 11:06, 17 April 2013

File Format Identification is the process of figuring out the format of a sequence of bytes. Operating systems typically do this by file extension or by embedded MIME information. Forensic applications need to identify file types by content.


Tools

libmagic

  • Written in C.
  • Rules in /usr/share/file/magic and compiled at runtime.
  • Powers the Unix “file” command, but you can also call the library directly from a C program.
  • http://sourceforge.net/projects/libmagic

Digital Preservation Efforts

PRONOM is a project of the National Archives of the United Kingdom to develop a registry of file types. A similar project was started by JSTOR and Harvard as the JSTOR/Harvard Object Validation Environment. Attempts are now underway to merge these two efforts in the Global Digital Format Registry and the Universal Digital Format Registry.

The UK National Archives developed the Digital Record Object Identification (DROID) tool, an "automatic file format identification tool." This tool is written in Java and can be downloaded from SourgeForge.

See:

TrID

Forensic Innovations File Investigator TOOLS

  • Proprietary, but free trial available.
  • Available as consumer applications and OEM API.
  • Identifies 3,000+ file types, using multiple methods to maintain high accuracy.
  • Extracts metadata for many of the supported file types.
  • http://www.forensicinnovations.com/fitools.html

Stellent/Oracle Outside-In

Forensic Assistant

  • Proprietary.
  • Provides detection of password protected archives, some files of cryptographic programs, Pinch/Zeus binary reports, etc.
  • http://nhtcu.ru/0xFA_eng.html

Data Sets

If you are working in the field of file format identification, please consider reporting the results of your algorithm with one of these publicly available data sets:

Bibliography

Current research papers on the file format identification problem. Most of these papers concern themselves with identifying file format of a few file sectors, rather than an entire file. Please note that this bibliography is in chronological order!


2001
2003
2005
  • Fileprints: identifying file types by n-gram analysis, LiWei-Jen, Wang Ke, Stolfo SJ, Herzog B.., IProceeding of the 2005 IEEE workshop on information assurance, 2005. (Presentation Slides) (PDF)
  • Douglas J. Hickok, Daine Richard Lesniak, Michael C. Rowe, File Type Detection Technology, 2005 Midwest Instruction and Computing Symposium.(PDF)
2006
  • Gregory A. Hall, Sliding Window Measurement for File Type Identification, Computer Forensics and Intrusion Analysis Group, ManTech Security and Mission Assurance, 2006. (PDF)
  • FORSIGS; Forensic Signature Analysis of the Hard Drive for Multimedia File Fingerprints, John Haggerty and Mark Taylor, IFIP TC11 International Information Security Conference, 2006, Sandton, South Africa.
  • Martin Karresand , Nahid Shahmehri, "Oscar -- Using Byte Pairs to Find File Type and Camera Make of Data Fragments," Annual Workshop on Digital Forensics and Incident Analysis, Pontypridd, Wales, UK, pp.85-94, Springer-Verlag, 2006.
2007
  • Robert F. Erbacher and John Mulholland, "Identification and Localization of Data Types within Large-Scale File Systems," Proceedings of the 2nd International Workshop on Systematic Approaches to Digital Forensic Engineering, Seattle, WA, April 2007.
  • Ryan M. Harris, "Using Artificial Neural Networks for Forensic File Type Identification," Master's Thesis, Purdue University, May 2007. (PDF)
  • Predicting the Types of File Fragments, William Calhoun, Drue Coles, DFRWS 2008. (Presentation Slides) (PDF)
2008
2009
  • Roussev, Vassil, and Garfinkel, Simson, "File Classification Fragment-The Case for Specialized Approaches," Systematic Approaches to Digital Forensics Engineering (IEEE/SADFE 2009), Oakland, California. (PDF)
2010
  • Irfan Ahmed, Kyung-suk Lhee, Hyunjung Shin and ManPyo Hong, Fast File-type Identification, Proceedings of the 25th ACM Symposium on Applied Computing (ACM SAC 2010), ACM, Sierre, Switzerland, March 2010.
2011
  • Irfan Ahmed, Kyung-Suk Lhee, Hyun-Jung Shin, Man-Pyo Hong, Fast Content-Based File Type Identification, Proceedings of the 7th Annual IFIP WG 11.9 International Conference on Digital Forensics, Orlando, FL, USA, February, 2011