Difference between pages "AFF" and "Bulk extractor"

From Forensics Wiki
(Difference between pages)
Jump to: navigation, search
m
 
m (Academic Publications)
 
Line 1: Line 1:
The '''Advanced Forensics Format''' ('''AFF''') is an extensible open format for the storage of [[disk image]]s and related forensic [[metadata]]. It was originally developed by [[Simson Garfinkel]] and [[Basis Technology]]. The last version of AFF is implemented in the [[AFFLIBv3]] library, which can be found on [https://github.com/simsong/AFFLIBv3 github].  [[AFF4]] builds upon many of the concepts developed in AFF.  AFF4 was developed by [[Michael Cohen]], Simson Garfinkel and Bradley Schatz. That version can be downloaded from [https://code.google.com/p/aff4/ Google Code].
+
== Overview ==
 +
'''bulk_extractor''' is a computer forensics tool that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results can be easily inspected, parsed, or processed with automated tools. '''bulk_extractor''' also created a histograms of features that it finds, as features that are more common tend to be more important. The program can be used for law enforcement, defense, intelligence, and cyber-investigation applications.
  
[[Sleuthkit]], [[Autopsy]] , [[OSFMount]], [[Xmount]], [[FTK Imager]] and [[FTK]] support the AFFv3 image format.
+
bulk_extractor is distinguished from other forensic tools by its speed and thoroughness. Because it ignores file system structure, bulk_extractor can process different parts of the disk in parallel. In practice, the program splits the disk up into 16MiByte pages and processes one page on each available core. This means that 24-core machines process a disk roughly 24 times faster than a 1-core machine. bulk_extractor is also thorough. That’s because bulk_extractor automatically detects, decompresses, and recursively re-processes compressed data that is compressed with a variety of algorithms. Our testing has shown that there is a significant amount of compressed data in the unallocated regions of file systems that is missed by most forensic tools that are commonly in use today.
  
=AFF Background=
+
Another advantage of ignoring file systems is that bulk_extractor can be used to process any digital media. We have used the program to process hard drives, SSDs, optical media, camera cards, cell phones, network packet dumps, and other kinds of digital information.
AFF is an open and extensible file format to store disk images and associated metadata. Using AFF, the user is not locked into a proprietary format that may limit how he or she may analyze it. An open standard enables investigators to quickly and efficiently use their preferred tools to solve crimes, gather intelligence, and resolve security incidents.
+
  
Use of proprietary file formats means converting from one format to another to use multiple tools. Converting between formats risks data corruption if the formats are not well understood. Metadata may be lost if all formats do not support the same forms of metadata.
+
==Output Feature Files==
==Extensible Design==
+
  
Use AFF to store any type of metadata such as GPS coordinates, chain of custody information, or any other user-defined data.
+
bulk_extractor now creates an output directory that has the following layout:
 +
;alerts.txt
 +
:Processing errors.
 +
;ccn.txt
 +
:Credit card numbers
 +
;ccn_track2.txt
 +
:Credit card “track 2″ informaiton, which has previously been found in some bank card fraud cases.
 +
;domain.txt
 +
:Internet domains found on the drive, including dotted-quad addresses found in text.
 +
;email.txt
 +
:Email addresses.
 +
;ether.txt
 +
;Ethernet MAC addresses found through IP packet carving of swap files and compressed system hibernation files and file fragments.
 +
;exif.txt
 +
:EXIFs from JPEGs and video segments. This feature file contains all of the EXIF fields, expanded as XML records.
 +
;find.txt
 +
:The results of specific regular expression search requests.
 +
;ip.txt
 +
:IP addresses found through IP packet carving.
 +
;rfc822.txt
 +
:Email message headers including Date:, Subject: and Message-ID: fields.
 +
;tcp.txt
 +
:TCP flow information found through IP packet carving.
 +
;telephone.txt
 +
:US and international telephone numbers.
 +
;url.txt
 +
:URLs, typically found in browser caches, email messages, and pre-compiled into executables.
 +
;url_searches.txt
 +
:A histogram of terms used in Internet searches from services such as Google, Bing, Yahoo, and others.
 +
;url_services.txt
 +
:A histogram of the domain name portion of all the URLs found on the media.
 +
;wordlist.txt
 +
:A list of all “words” extracted from the disk, useful for password cracking.
 +
;wordlist_*.txt
 +
:The wordlist with duplicates removed, formatted in a form that can be easily imported into a popular password-cracking program.
 +
;zip.txt
 +
:A file containing information regarding every ZIP file component found on the media. This is exceptionally useful as ZIP files contain internal structure and ZIP is increasingly the compound file format of choice for a variety of products such as Microsoft Office
  
AFF supports the definition of arbitrary metadata by storing all data as name and value pairs, called segments. Some segments store the disk data and others store metadata. Because of this general design, any metadata can be defined by simply creating a new name and value pair. Each of the segments can be compressed to reduce the size of drive images, and cryptographic hashes can be calculated for each segment to ensure data integrity.
+
For each of the above, two additional files may be created:
 +
;*_stopped.txt
 +
:bulk_extractor supports a stop list, or a list of items that do not need to be brought to the user’s attention. However rather than simply suppressing this information, which might cause something critical to be hidden, stopped entries are stored in the stopped files.
 +
;*_histogram.txt
 +
:bulk_extractor can also create histograms of features. This is important, as experience has shown that email addresses, domain names, URLs, and other informaiton that appear more frequently on a hard drive or in a cell phone’s memory can be used to rapidly create a pattern of life report.
  
==Flexible Design==
+
Bulk extractor also creates a file that captures the provenance of the run:
 +
;report.xml
 +
:A Digital Forensics XML report that includes information about the source media, how the bulk_extractor program was compiled and run, the time to process the digital evidence, and a meta report of the information that was found.
  
For flexibility, there are three variations of AFF files – AFF, AFD and AFM – and freely available tools to easily convert between the variations.
+
==Post-Processing==
  
The original AFF format is a single file that contains segments with drive data and metadata. Its contents can be compressed, but it can be quite large as the data on modern hard disks often reach 100GB in size.
+
We have developed four programs for post-processing the bulk_extractor output:
 +
;bulk_diff.py
 +
:This program reports the differences between two bulk_extractor runs. The intent is to image a computer, run bulk_extractor on a disk image, let the computer run for a period of time, re-image the computer, run bulk_extractor on the second image, and then report the differences. This can be used to infer the user’s activities within a time period.
 +
;cda_tool.py
 +
:This tool, currently under development, reads multiple bulk_extractor reports from multiple runs against multiple drives and performs a multi-drive correlation using Garfinkel’s Cross Drive Analysis technique. This can be used to automatically identify new social networks or to identify new members of existing networks.
 +
;identify_filenames.py
 +
:In the bulk_extractor feature file, each feature is annotated with the byte offset from the beginning of the image in which it was found. The program takes as input a bulk_extractor feature file and a DFXML file containing the locations of each file on the drive (produced with Garfinkel’s fiwalk program) and produces an annotated feature file that contains the offset, feature, and the file in which the feature was found.
 +
;make_context_stop_list.py
 +
:Although forensic analysts frequently make “stop lists”—for example, a lsit of email addresses that appear in the operating system and should therefore be ignored—such lists have a significant problem. Because it is relatively easy to get an email address into the binary of an open source application, ignoring all of these email addresses may make it possible to cloak email addresses from forensic analysis. Our solution is to create context-sensitive stop lists, in which the feature to be stopped is presented with the context in which it occures. The make_context_stop_list.py program takes the results of multiple bulk_extractor runs and creates a single context-sensitive stop list that can then be used to suppress features when found in a specific context. One such stop list constructed from Windows and Linux operating systems is available on the bulk extractor website.
  
For ease of transfer, large AFF files can be broken into multiple AFD format files. The smaller AFD files can be readily moved around a FAT32 file system which limits files to 2GB or stored on DVDs, which have similar size restrictions.
+
== Download ==
 +
The current version of '''bulk_extractor''' is 1.3. It can be downloaded from http://digitalcorpora.org/downloads/bulk_extractor/
  
The AFM format stores the metadata in an AFF file, and the disk data in a separate raw file. This format allows analysis tools that support the raw format to access the data, but without losing the metadata.
+
== Bibliography ==
 +
=== Academic Publications ===
 +
# Garfinkel, Simson Simson, [http://simson.net/clips/academic/2013.COSE.bulk_extractor.pdf Digital media triage with bulk data analysis and bulk_extractor]. Computers and Security 32: 56-72 (2013)
 +
# Beverly, Robert, Simson Garfinkel and Greg Cardwell, [http://simson.net/clips/academic/2011.DFRWS.ipcarving.pdf "Forensic Carving of Network Packets and Associated Data Structures"], DFRWS 2011, Aug. 1-3, 2011, New Orleans, LA. BEST PAPER AWARD (Acceptance rate: 23%, 14/62)
 +
#Garfinkel, S., [http://simson.net/clips/academic/2006.DFRWS.pdf Forensic Feature Extraction and Cross-Drive Analysis,]The 6th Annual Digital Forensic Research Workshop Lafayette, Indiana, August 14-16, 2006. (Acceptance rate: 43%, 16/37)
  
==Compression and Encryption==
+
===Tutorials===
AFF supports two compression algorithms: zlib, which is fast and reasonably efficient, and LZMA, which is slower but dramatically more efficient. zlib is the same compression algorithm used by EnCase. As a result, AFF files compressed with zlib are roughly the same size as the equivalent EnCase file. AFF files can be recompressed using the LZMA algorithm. These files are anywhere from 1/2 to 1/10th the size of the original AFF/EnCase file.
+
# [http://simson.net/ref/2012/2012-08-08%20bulk_extractor%20Tutorial.pdf Using bulk_extractor for digital forensics triage and cross-drive analysis], DFRWS 2012
 
+
AFF2.0 supports encryption of disk images. Unlike the password implemented by EnCase, encrypted images cannot be accessed without the necessary encryption key. FTK Imager/FTK added support for this encryption  in version 3.0 and are able to create and access AFF encrypted images.
+
 
+
= AFF Tools =
+
 
+
* [[aimage]]
+
* [[ident]]
+
* [[afcat]]
+
* [[afcompare]]
+
* [[afconvert]]
+
* [[affix]]
+
* [[affuse]]
+
* [[afinfo]]
+
* [[afstats]]
+
* [[afxml]]
+
* [[afsegment]]
+
 
+
= See Also =
+
 
+
* [[AFF Developers Guide]] --- A guide for programmers on how to use the AFF
+
* [[AFF Development Task List]] --- Want to help with AFF? Here is a list of things that need to be done.
+
 
+
== External Links ==
+
 
+
* [http://www.afflib.org/ Official website]
+
* [http://www.basistech.com/digital-forensics/aff.html Basis Technology's AFF website]
+
* [http://www.osforensics.com/tools/mount-disk-images.html OSFMount - 3rd party tool for mounting AFF disk images with a drive letter]
+
 
+
[[Category:Forensics File Formats]]
+
[[Category:Open Source Tools]]
+

Revision as of 11:29, 9 April 2013

Contents

Overview

bulk_extractor is a computer forensics tool that scans a disk image, a file, or a directory of files and extracts useful information without parsing the file system or file system structures. The results can be easily inspected, parsed, or processed with automated tools. bulk_extractor also created a histograms of features that it finds, as features that are more common tend to be more important. The program can be used for law enforcement, defense, intelligence, and cyber-investigation applications.

bulk_extractor is distinguished from other forensic tools by its speed and thoroughness. Because it ignores file system structure, bulk_extractor can process different parts of the disk in parallel. In practice, the program splits the disk up into 16MiByte pages and processes one page on each available core. This means that 24-core machines process a disk roughly 24 times faster than a 1-core machine. bulk_extractor is also thorough. That’s because bulk_extractor automatically detects, decompresses, and recursively re-processes compressed data that is compressed with a variety of algorithms. Our testing has shown that there is a significant amount of compressed data in the unallocated regions of file systems that is missed by most forensic tools that are commonly in use today.

Another advantage of ignoring file systems is that bulk_extractor can be used to process any digital media. We have used the program to process hard drives, SSDs, optical media, camera cards, cell phones, network packet dumps, and other kinds of digital information.

Output Feature Files

bulk_extractor now creates an output directory that has the following layout:

alerts.txt
Processing errors.
ccn.txt
Credit card numbers
ccn_track2.txt
Credit card “track 2″ informaiton, which has previously been found in some bank card fraud cases.
domain.txt
Internet domains found on the drive, including dotted-quad addresses found in text.
email.txt
Email addresses.
ether.txt
Ethernet MAC addresses found through IP packet carving of swap files and compressed system hibernation files and file fragments.
exif.txt
EXIFs from JPEGs and video segments. This feature file contains all of the EXIF fields, expanded as XML records.
find.txt
The results of specific regular expression search requests.
ip.txt
IP addresses found through IP packet carving.
rfc822.txt
Email message headers including Date:, Subject: and Message-ID: fields.
tcp.txt
TCP flow information found through IP packet carving.
telephone.txt
US and international telephone numbers.
url.txt
URLs, typically found in browser caches, email messages, and pre-compiled into executables.
url_searches.txt
A histogram of terms used in Internet searches from services such as Google, Bing, Yahoo, and others.
url_services.txt
A histogram of the domain name portion of all the URLs found on the media.
wordlist.txt
A list of all “words” extracted from the disk, useful for password cracking.
wordlist_*.txt
The wordlist with duplicates removed, formatted in a form that can be easily imported into a popular password-cracking program.
zip.txt
A file containing information regarding every ZIP file component found on the media. This is exceptionally useful as ZIP files contain internal structure and ZIP is increasingly the compound file format of choice for a variety of products such as Microsoft Office

For each of the above, two additional files may be created:

  • _stopped.txt
bulk_extractor supports a stop list, or a list of items that do not need to be brought to the user’s attention. However rather than simply suppressing this information, which might cause something critical to be hidden, stopped entries are stored in the stopped files.
  • _histogram.txt
bulk_extractor can also create histograms of features. This is important, as experience has shown that email addresses, domain names, URLs, and other informaiton that appear more frequently on a hard drive or in a cell phone’s memory can be used to rapidly create a pattern of life report.

Bulk extractor also creates a file that captures the provenance of the run:

report.xml
A Digital Forensics XML report that includes information about the source media, how the bulk_extractor program was compiled and run, the time to process the digital evidence, and a meta report of the information that was found.

Post-Processing

We have developed four programs for post-processing the bulk_extractor output:

bulk_diff.py
This program reports the differences between two bulk_extractor runs. The intent is to image a computer, run bulk_extractor on a disk image, let the computer run for a period of time, re-image the computer, run bulk_extractor on the second image, and then report the differences. This can be used to infer the user’s activities within a time period.
cda_tool.py
This tool, currently under development, reads multiple bulk_extractor reports from multiple runs against multiple drives and performs a multi-drive correlation using Garfinkel’s Cross Drive Analysis technique. This can be used to automatically identify new social networks or to identify new members of existing networks.
identify_filenames.py
In the bulk_extractor feature file, each feature is annotated with the byte offset from the beginning of the image in which it was found. The program takes as input a bulk_extractor feature file and a DFXML file containing the locations of each file on the drive (produced with Garfinkel’s fiwalk program) and produces an annotated feature file that contains the offset, feature, and the file in which the feature was found.
make_context_stop_list.py
Although forensic analysts frequently make “stop lists”—for example, a lsit of email addresses that appear in the operating system and should therefore be ignored—such lists have a significant problem. Because it is relatively easy to get an email address into the binary of an open source application, ignoring all of these email addresses may make it possible to cloak email addresses from forensic analysis. Our solution is to create context-sensitive stop lists, in which the feature to be stopped is presented with the context in which it occures. The make_context_stop_list.py program takes the results of multiple bulk_extractor runs and creates a single context-sensitive stop list that can then be used to suppress features when found in a specific context. One such stop list constructed from Windows and Linux operating systems is available on the bulk extractor website.

Download

The current version of bulk_extractor is 1.3. It can be downloaded from http://digitalcorpora.org/downloads/bulk_extractor/

Bibliography

Academic Publications

  1. Garfinkel, Simson Simson, Digital media triage with bulk data analysis and bulk_extractor. Computers and Security 32: 56-72 (2013)
  2. Beverly, Robert, Simson Garfinkel and Greg Cardwell, "Forensic Carving of Network Packets and Associated Data Structures", DFRWS 2011, Aug. 1-3, 2011, New Orleans, LA. BEST PAPER AWARD (Acceptance rate: 23%, 14/62)
  3. Garfinkel, S., Forensic Feature Extraction and Cross-Drive Analysis,The 6th Annual Digital Forensic Research Workshop Lafayette, Indiana, August 14-16, 2006. (Acceptance rate: 43%, 16/37)

Tutorials

  1. Using bulk_extractor for digital forensics triage and cross-drive analysis, DFRWS 2012