Difference between revisions of "Carver 2.0 Planning Page"

From ForensicsWiki
Jump to: navigation, search
(Discussion)
m (Configuration language/specification: claiming my badness)
(55 intermediate revisions by 5 users not shown)
Line 16: Line 16:
  
 
BSD-3.
 
BSD-3.
 +
:: [[User:Joachim Metz|Joachim]] library based validators could require other licenses
  
 
= OS =
 
= OS =
  
 
Linux/FreeBSD/MacOS
 
Linux/FreeBSD/MacOS
: (shouldn't this just match what the underlying afflib & sleuthkit cover? [[User:RB|RB]])
+
: Shouldn't this just match what the underlying afflib & sleuthkit cover? [[User:RB|RB]]
 
:: Yes, but you need to test and validate on each. Question: Do we want to support windows? [[User:Simsong|Simsong]] 21:09, 30 October 2008 (UTC)
 
:: Yes, but you need to test and validate on each. Question: Do we want to support windows? [[User:Simsong|Simsong]] 21:09, 30 October 2008 (UTC)
 
:: [[User:Joachim Metz|Joachim]] I think we would do wise to design with windows support from the start this will improve the platform independence from the start
 
:: [[User:Joachim Metz|Joachim]] I think we would do wise to design with windows support from the start this will improve the platform independence from the start
 +
:::: Agreed; I would even settle at first for being able to run against Cygwin.  Note that I don't even own or use a copy of Windows, but the vast majority of forensic investigators do. [[User:RB|RB]] 14:01, 31 October 2008 (UTC)
  
= Requirements =
+
= Name tooling =
  
 
* [[User:Joachim Metz|Joachim]] A name for the tooling I propose coldcut
 
* [[User:Joachim Metz|Joachim]] A name for the tooling I propose coldcut
 +
:: How about 'butcher'?  ;)  [[User:RB|RB]] 14:20, 31 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] cleaver ( scalpel on steroids ;-) )
 +
* I would like to propose Gouge or Chisel :-) [[User:Capibara|Rob J Meijer]]
 +
 +
= Requirements =
  
 
[[User:Joachim Metz|Joachim]] Could we do a MoSCoW evaluation of these.
 
[[User:Joachim Metz|Joachim]] Could we do a MoSCoW evaluation of these.
  
 
* AFF and EWF file images supported from scratch. ([[User:Joachim Metz|Joachim]] I would like to have raw/split raw and device access as well)
 
* AFF and EWF file images supported from scratch. ([[User:Joachim Metz|Joachim]] I would like to have raw/split raw and device access as well)
 +
:: If we base our image i/o on afflib, we get all three with one interface. [[User:RB|RB]] Instead of letting the tools use afflib, better to write an afflib module for carvfs, and update the libewf module. The tool could than be oblivious of the file format. [[User:Capibara|Rob J Meijer]]
 +
:::: [[User:Joachim Metz|Joachim]] this layer should support multi threaded decompression of compressed image types, this speeds up IO
 
* [[User:Joachim Metz|Joachim]] volume/partition aware layer (what about carving unpartioned space)
 
* [[User:Joachim Metz|Joachim]] volume/partition aware layer (what about carving unpartioned space)
* File system aware layer.  
+
* File system aware layer. This could be or make use of tsk-cp.
** By default, files are not carved. (clarify: only identified? [[User:RB|RB]]; I guess that it operates like [[Selective file dumper]] [[User:.FUF|.FUF]] 07:00, 29 October 2008 (UTC))
+
** By default, files are not carved. (clarify: only identified? [[User:RB|RB]]; I guess that it operates like [[Selective file dumper]] [[User:.FUF|.FUF]] 07:00, 29 October 2008 (UTC)). Alternatively, the tool could use libcarvpath and output carvpaths or create a directory with symlinks to carvpaths that point into a carvfs mountpoint [[User:Capibara|Rob J Meijer]].
 
* Plug-in architecture for identification/validation.
 
* Plug-in architecture for identification/validation.
 
** [[User:Joachim Metz|Joachim]] support for multiple types of validators
 
** [[User:Joachim Metz|Joachim]] support for multiple types of validators
 
*** dedicated validator
 
*** dedicated validator
*** validator based on file library (i.e. we could specify/implement a file structure for these)
+
*** validator based on file library (i.e. we could specify/implement a file structure API for these)
 
*** configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)
 
*** configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)
 
* Ship with validators for:
 
* Ship with validators for:
Line 59: Line 68:
 
*** PDF
 
*** PDF
 
*** Open Office/Office 2007 (ZIP+XML)
 
*** Open Office/Office 2007 (ZIP+XML)
 +
:: Extension validation? AFAIK, MS Office 2007 [[DOCX]] format uses plain ZIP (or not?), and carved files will (or not?) have .zip extension instead of DOCX. Is there any way to fix this (may be using the file list in zip)? [[User:.FUF|.FUF]] 20:25, 31 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] Addition: Office 2007 also has a binary file format which is also a ZIP-ed data
 +
 
** Archive files
 
** Archive files
 
*** ZIP
 
*** ZIP
 +
*** 7z
 +
*** gzip
 +
*** bzip2
 +
*** tar
 +
*** RAR
 
** E-mail files
 
** E-mail files
 
*** PFF (PST/OST)
 
*** PFF (PST/OST)
Line 70: Line 87:
 
*** ASF/WMV
 
*** ASF/WMV
 
*** QuickTime
 
*** QuickTime
 +
*** MKV
 
** Printer spool files
 
** Printer spool files
 
*** EMF (if I remember correctly)
 
*** EMF (if I remember correctly)
Line 77: Line 95:
 
** Other files
 
** Other files
 
*** thumbs.db
 
*** thumbs.db
 +
*** pagefile?
  
 
* Simple fragment recovery carving using gap carving.
 
* Simple fragment recovery carving using gap carving.
Line 85: Line 104:
 
** [[User:Joachim Metz|Joachim]] as much as possible, but allow to be overwritten by user
 
** [[User:Joachim Metz|Joachim]] as much as possible, but allow to be overwritten by user
 
* Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
 
* Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
** Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time)
+
** Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time) [[User:RB|RB]]
 
** [[User:Joachim Metz|Joachim]] have multiple carving phases for precision/speed trade off?
 
** [[User:Joachim Metz|Joachim]] have multiple carving phases for precision/speed trade off?
 
* Parallelizable
 
* Parallelizable
Line 99: Line 118:
 
* Easy integration into ascription software.
 
* Easy integration into ascription software.
 
** [[User:Joachim Metz|Joachim]] I'm no native speaker what do you mean with "ascription software"?
 
** [[User:Joachim Metz|Joachim]] I'm no native speaker what do you mean with "ascription software"?
 +
:: I think this was another non-native requesting easy scriptability. [[User:RB|RB]] 14:20, 31 October 2008 (UTC)
 +
:::: [[User:Joachim Metz|Joachim]] that makes sense ;-)
 +
* [[User:Joachim Metz|Joachim]] When the tool output files the filenames should contain the offset in the input data (in hexadecimal?)
 +
* [[User:Joachim Metz|Joachim]] Should the tool allow to export embedded files?
 +
* [[User:Joachim Metz|Joachim]] Should the tool allow to export fragments separately?
 +
* [[User:Mark Stam|Mark]] I really like the fact carved files are named after the physical or logical sector in which the file is found (photorec)
 +
* [[User:Mark Stam|Mark]] I personally use photorec often for carving files in the whole volume (not only unallocated clusters), so I can store information about all potential interesting files in MySQL
 +
* [[User:Mark Stam|Mark]] It would also be nice if the files can be hashed immediately (MD5) so looking for them in other tools (for example Encase) is a snap
  
 
= Ideas =
 
= Ideas =
 
* Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
 
* Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
** [[User:Joachim Metz|Joachim]] using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files)
+
:: [[User:Joachim Metz|Joachim]] using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files) I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.  
I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.  
+
 
* Extracting/carving data from [[Thumbs.db]]? I've used [[foremost]] for it with some success. [[Vinetto]] has some critical bugs :( [[User:.FUF|.FUF]] 19:18, 28 October 2008 (UTC)
 
* Extracting/carving data from [[Thumbs.db]]? I've used [[foremost]] for it with some success. [[Vinetto]] has some critical bugs :( [[User:.FUF|.FUF]] 19:18, 28 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] this poses an interesting addition to the carver do we want to support (let's call it) 'recursive in file carving' (for now) this is different from embedded files because there is a file system structure in the file and not just another file structure
 +
 +
[[User:Capibara|Rob J Meijer]] :
 +
* Use libcarvpath whenever possible and by default to avoid high storage requirements.
 +
:: [[User:Joachim Metz|Joachim]] For easy deployment I would not opt for making an integral part of the tool solely dependant on a single external library or the library must be integrated in the package
 +
* Dont stop with filesystem detection after the first match. Often if a partition is reused with a new FS and is not all that full yet, much of the old FS can still be valid. I have seen this with ext2/fat. The fact that you have identified a valid FS on a partition doesn't mean there isn't an(almost) valid second FS that would yield additional files. Identifying doubly allocated space might in some cases also be relevant.
 +
:: [[User:Joachim Metz|Joachim]] What your saying is that dealing with file system fragments should be part of the carving algorithm
 +
* Allow use where filesystem based carving is done by other tool, and the tool is used as second stage on (sets of) unallocated block (pseudo) files and/or non FS partition (pseudo) files.
 +
:: [[User:Joachim Metz|Joachim]] I would not opt for this. The tool would be dependent on other tools and their data format, which makes the tool difficult to maintain. I would opt to integrate the functionality of having multiple recovery phases (stages) and allow the tooling to run the phases after one and other or separately.
 +
* Ability to be used as a library instead of a tool. Ability to access metadata true library, and thus the ability to set metadata from the carving modules. This would be extremely usefull for integrating the project into a framework like ocfa.
 +
:: [[User:Joachim Metz|Joachim]] I guess most of the code could be integrated into libraries, but I would not opt integrating tool functionality into a library
 +
* [[User:Mark Stam|Mark]] I think it would be very handy to have a CSV, TSV, XML or other delimited output (log)file with information about carved files. This output file can then be stored in a database or Excel sheet (report function)
 +
 +
== Configuration language/specification ==
 
* Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat [[User:.FUF|.FUF]] 20:51, 28 October 2008 (UTC)
 
* Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat [[User:.FUF|.FUF]] 20:51, 28 October 2008 (UTC)
 
** This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"
 
** This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"
 
 
.FUF added:
 
.FUF added:
 
The main idea is to allow users to define structures, for example (in pascal-like form):
 
The main idea is to allow users to define structures, for example (in pascal-like form):
Line 132: Line 171:
  
 
Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was).  As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. [[User:RB|RB]]
 
Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was).  As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. [[User:RB|RB]]
 +
 +
[[User:Joachim Metz|Joachim]]
 +
In my option your example is too limited. Making the revit configuration I learned you'll need a near programming language to specify some file formats.
 +
A simple descriptive language is too limiting. I would also go for 2 bytes with endianess instead of using terminology like WORD and small integer, it's much more clear. The configuration also needs to deal with aspects like cardinality, required and optional structures.
 +
:: This is simply data structures carving, see ideas above. Somebody (I cannot track so many changes per day) separated the original text. There is no need to count and join different structures. [[User:.FUF|.FUF]] 19:53, 31 October 2008 (UTC)
 +
:::: [[User:Joachim Metz|Joachim]] This was probably me is the text back in it's original form?
 +
:::: I started it by moving your Revit07 comment to the validator/plugin section in [http://www.forensicswiki.org/index.php?title=Carver_2.0_Planning_Page&diff=prev&oldid=7583 this edit], since I was still at that point thinking operational configuration for that section, not parser configurations.
 +
 +
Please take a look at the revit07 configuration. It's not there yet but goes a far way. Some things currently missing:
 +
* bitwise alignment
 +
* handling encapsulated streams (MPEG/capture files)
 +
* handling content based formats (MBOX)
 +
 +
=Caving algorithm =
 +
[[User:Joachim Metz|Joachim]]
 +
* should we allow for multiple carving phases (runs/stages)?
 +
:: I opt yes (separation of concern)
 +
* should we allow for multiple carving algorithms?
 +
:: I opt yes, this allows testing of different approaches
 +
* Should the algorithm try to do as much in 1 run over the input data? To reduce IO?
 +
:: I opt that the tool should allow for multiple and single run over the input data to minimize the IO or the CPU as bottleneck
 +
* Interaction between algorithm and validators
 +
** does the algorithm passes data blocks to the validators?
 +
** does a validator need to maintain a state?
 +
** does a validator need to revert a state?
 +
** How do we deal with embedded files and content validation? Do the validators call another validator?
 +
* do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?
 +
* Revit07 allows for multiple concurrent result files states to deal with fragmentation. One has the attribute of being active (the preferred) and the other passive. Do we want/need something similar? The algorithm adds block of input data (offsets) to these result files states.
 +
** if so what info would these result files states require (type, list of input data blocks)
 +
* how do we deal with file system remainders?
 +
** Can we abstract them and compare them against available file system information?
 +
* Do we carve file systems in files?
 +
:: I opt that at least the validator uses this information
 +
 +
==Caving scenarios ==
 +
[[User:Joachim Metz|Joachim]]
 +
* normal file (file structure, loose text based structure (more a content structure?))
 +
* fragmented file (the file entirely exist)
 +
* a file fragment (the file does not entirely exist)
 +
* intertwined file
 +
* encapsulated file (MPEG/network capture)
 +
* embedded file (JPEG thumbnail)
 +
* obfuscation ('encrypted' PFF) this also entails encryption and/or compression
 +
* file system in file
  
 
=File System Awareness =
 
=File System Awareness =
 
==Background: Why be File System Aware?==
 
==Background: Why be File System Aware?==
 
Advantages of being FS aware:
 
Advantages of being FS aware:
* You can pick up sector allocation sizes ([[User:Joachim Metz|Joachim]] do you mean file system block sizes?)
+
* You can pick up sector allocation sizes
 +
:: [[User:Joachim Metz|Joachim]] do you mean file system block sizes?
 
* Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
 
* Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
 
* Increasingly file systems have compression (NTFS compression)
 
* Increasingly file systems have compression (NTFS compression)
Line 145: Line 229:
 
==Discussion==
 
==Discussion==
 
:: As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion.  If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself.  [[User:RB|RB]]
 
:: As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion.  If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself.  [[User:RB|RB]]
 
[[User:Joachim Metz|Joachim]] I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.
 
  
 
:::: I guess this tool operates like [[Selective file dumper]] and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. [[User:.FUF|.FUF]] 07:08, 29 October 2008 (UTC)
 
:::: I guess this tool operates like [[Selective file dumper]] and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. [[User:.FUF|.FUF]] 07:08, 29 October 2008 (UTC)
  
 
:: This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later.  The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well.  This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort.  [[User:RB|RB]]
 
:: This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later.  The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well.  This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort.  [[User:RB|RB]]
 +
 +
[[User:Joachim Metz|Joachim]] I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.
  
 
[[User:Joachim Metz|Joachim]]  
 
[[User:Joachim Metz|Joachim]]  
 
I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.
 
I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.
  
 +
=Supportive tooling=
 
[[User:Joachim Metz|Joachim]]
 
[[User:Joachim Metz|Joachim]]
I'm missing a part on the page about the carving challenges (scenarios)
+
* validator (definitions) tester (detest in revit07)
* normal file (file structure, loose text based structure (more a content structure?))
+
* tool to make configuration based definitions
* fragmented file (the file entirely exist)
+
* post carving validation
* a file fragment (the file does not entirely exist)
+
* the carver needs to provide support for fuse mount of carved files (carvfs)
* intertwined file
+
* encapsulated file (MPEG/network capture)
+
* embedded file (JPEG thumbnail)
+
  
[[User:Joachim Metz|Joachim]]  
+
=Testing =
I'm missing a part on the page about the carving algorithm
+
[[User:Joachim Metz|Joachim]]
* should we allow for multiple runs?
+
* automated testing
* should we allow for multiple algorithms?
+
* test data
* does the algorithm passes data blocks to the validators?
+
* does a validator need to maintain a state?
+
* does a validator need to revert a state?
+
* do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?
+
  
 
=Validator Construction=
 
=Validator Construction=
 
Options:
 
Options:
 
* Write validators in C/C++
 
* Write validators in C/C++
 +
** [[User:Joachim Metz|Joachim]] you mean dedicated validators
 
* Have a scripting language for writing them (python? Perl?) our own?
 
* Have a scripting language for writing them (python? Perl?) our own?
 
** [[User:Joachim Metz|Joachim]] use easy to embed programming languages i.e. Phyton or Lua
 
** [[User:Joachim Metz|Joachim]] use easy to embed programming languages i.e. Phyton or Lua
Line 182: Line 261:
  
 
=Existing Code that we have=
 
=Existing Code that we have=
 +
[[User:Joachim Metz|Joachim]]
 +
Please add any missing links
  
[[User:Joachim Metz|Joachim]]
+
Documentation/Articles
 
* DFRWS2006/2007 carving challenge results
 
* DFRWS2006/2007 carving challenge results
* photorec
+
* DFRWS2008 paper on carving
* revit06 and revit07  
+
 
 +
Carvers
 +
* DFRWS2006/2007 carving challenge results
 +
* photorec (http://www.cgsecurity.org/wiki/PhotoRec)
 +
* revit06 and revit07 (http://sourceforge.net/projects/revit/)
 
* s3/scarve
 
* s3/scarve
 +
 +
Possible file structure validator libraries
 +
* divers existing file support libraries
 +
* libole2 (inhouse experimental code of OLE2 support)
 +
* libpff (alpha release for PFF (PST/OST) file support) (http://sourceforge.net/projects/libpff/)
 +
 +
Input support
 +
* AFF (http://www.afflib.org/)
 +
* EWF (http://sourceforge.net/projects/libewf/)
 +
* TSK device & raw & split raw (http://www.sleuthkit.org/)
 +
 +
Volume/Partition support
 +
* disktype (http://disktype.sourceforge.net/)
 +
* testdisk (http://www.cgsecurity.org/wiki/TestDisk)
 +
* TSK
 +
 +
File system support
 +
* TSK
 +
* photorec FS code
 +
* implementations of FS in Linux/BSD
 +
 +
Content support
 +
 +
Zero storage support
 +
* libcarvpath
 +
* carvfs
  
 
=Implementation Timeline=
 
=Implementation Timeline=
Line 203: Line 314:
 
## Discuss options for a grammar driven validator?
 
## Discuss options for a grammar driven validator?
 
## Hard-coded plug-ins?
 
## Hard-coded plug-ins?
## Which exsisting code can we use?
+
## Which existing code can we use?
 
# start building/assembling parts of the tooling for a prototype
 
# start building/assembling parts of the tooling for a prototype
 
## Implement simple file carving with validation.
 
## Implement simple file carving with validation.
Line 209: Line 320:
 
# Initial Release
 
# Initial Release
 
# Implement the ''threaded carving'' that [[User:.FUF|.FUF]] is describing above.
 
# Implement the ''threaded carving'' that [[User:.FUF|.FUF]] is describing above.
 +
 +
[[User:Joachim Metz|Joachim]] Shouldn't multi threaded carving (MTC) not be part of the 1st version?
 +
The MT approach makes for different design decisions

Revision as of 12:23, 1 November 2008

This page is for planning Carver 2.0.

Please, do not delete text (ideas) here. Use something like this:

<s>bad idea</s>
:: good idea

This will look like:

bad idea

good idea

License

BSD-3.

Joachim library based validators could require other licenses

OS

Linux/FreeBSD/MacOS

Shouldn't this just match what the underlying afflib & sleuthkit cover? RB
Yes, but you need to test and validate on each. Question: Do we want to support windows? Simsong 21:09, 30 October 2008 (UTC)
Joachim I think we would do wise to design with windows support from the start this will improve the platform independence from the start
Agreed; I would even settle at first for being able to run against Cygwin. Note that I don't even own or use a copy of Windows, but the vast majority of forensic investigators do. RB 14:01, 31 October 2008 (UTC)

Name tooling

  • Joachim A name for the tooling I propose coldcut
How about 'butcher'?  ;) RB 14:20, 31 October 2008 (UTC)
Joachim cleaver ( scalpel on steroids ;-) )

Requirements

Joachim Could we do a MoSCoW evaluation of these.

  • AFF and EWF file images supported from scratch. (Joachim I would like to have raw/split raw and device access as well)
If we base our image i/o on afflib, we get all three with one interface. RB Instead of letting the tools use afflib, better to write an afflib module for carvfs, and update the libewf module. The tool could than be oblivious of the file format. Rob J Meijer
Joachim this layer should support multi threaded decompression of compressed image types, this speeds up IO
  • Joachim volume/partition aware layer (what about carving unpartioned space)
  • File system aware layer. This could be or make use of tsk-cp.
    • By default, files are not carved. (clarify: only identified? RB; I guess that it operates like Selective file dumper .FUF 07:00, 29 October 2008 (UTC)). Alternatively, the tool could use libcarvpath and output carvpaths or create a directory with symlinks to carvpaths that point into a carvfs mountpoint Rob J Meijer.
  • Plug-in architecture for identification/validation.
    • Joachim support for multiple types of validators
      • dedicated validator
      • validator based on file library (i.e. we could specify/implement a file structure API for these)
      • configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)
  • Ship with validators for:

Joachim I think we should distinguish between file format validators and content validators

    • JPEG
    • PNG
    • GIF
    • MSOLE
    • ZIP
    • TAR (gz/bz2)

Joachim For a production carver we need at least the following formats

    • Grapical Images
      • JPEG (the 3 different types with JFIF/EXIF support)
      • PNG
      • GIF
      • BMP
      • TIFF
    • Office documents
      • OLE2 (Word/Excell content support)
      • PDF
      • Open Office/Office 2007 (ZIP+XML)
Extension validation? AFAIK, MS Office 2007 DOCX format uses plain ZIP (or not?), and carved files will (or not?) have .zip extension instead of DOCX. Is there any way to fix this (may be using the file list in zip)? .FUF 20:25, 31 October 2008 (UTC)
Joachim Addition: Office 2007 also has a binary file format which is also a ZIP-ed data
    • Archive files
      • ZIP
      • 7z
      • gzip
      • bzip2
      • tar
      • RAR
    • E-mail files
      • PFF (PST/OST)
      • MBOX (text based format, base64 content support)
    • Audio/Video files
      • MPEG
      • MP2/MP3
      • AVI
      • ASF/WMV
      • QuickTime
      • MKV
    • Printer spool files
      • EMF (if I remember correctly)
    • Internet history files
      • index.dat
      • firefox (sqllite 3)
    • Other files
      • thumbs.db
      • pagefile?
  • Simple fragment recovery carving using gap carving.
    • Joachim have hook in for more advanced fragment recovery?
  • Recovering of individual ZIP sections and JPEG icons that are not sector aligned.
    • Joachim I would propose a generic fragment detection and recovery
  • Autonomous operation (some mode of operation should be completely non-interactive, requiring no human intervention to complete RB)
    • Joachim as much as possible, but allow to be overwritten by user
  • Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
    • Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time) RB
    • Joachim have multiple carving phases for precision/speed trade off?
  • Parallelizable
    • Joachim tunable for different architectures
  • Configuration:
    • Capability to parse some existing carvers' configuration files, either on-the-fly or as a one-way converter.
    • Disengage internal configuration structure from configuration files, create parsers that present the expected structure
    • Joachim The validator should deal with the file structure the carving algorithm should not know anything about the file structure (as in revit07 design)
    • Either extend Scalpel/Foremost syntaxes for extended features or use a tertiary syntax (Joachim I would prefer a derivative of the revit07 configuration syntax which already has encountered some problems of dealing with defining file structure in a configuration file)
  • Can output audit.txt file.
  • Joachim Can output database with offset analysis values i.e. for visualization tooling
  • Joachim Can output debug log for debugging the algorithm/validation
  • Easy integration into ascription software.
    • Joachim I'm no native speaker what do you mean with "ascription software"?
I think this was another non-native requesting easy scriptability. RB 14:20, 31 October 2008 (UTC)
Joachim that makes sense ;-)
  • Joachim When the tool output files the filenames should contain the offset in the input data (in hexadecimal?)
  • Joachim Should the tool allow to export embedded files?
  • Joachim Should the tool allow to export fragments separately?
  • Mark I really like the fact carved files are named after the physical or logical sector in which the file is found (photorec)
  • Mark I personally use photorec often for carving files in the whole volume (not only unallocated clusters), so I can store information about all potential interesting files in MySQL
  • Mark It would also be nice if the files can be hashed immediately (MD5) so looking for them in other tools (for example Encase) is a snap

Ideas

  • Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
Joachim using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files) I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.
  • Extracting/carving data from Thumbs.db? I've used foremost for it with some success. Vinetto has some critical bugs :( .FUF 19:18, 28 October 2008 (UTC)
Joachim this poses an interesting addition to the carver do we want to support (let's call it) 'recursive in file carving' (for now) this is different from embedded files because there is a file system structure in the file and not just another file structure

Rob J Meijer :

  • Use libcarvpath whenever possible and by default to avoid high storage requirements.
Joachim For easy deployment I would not opt for making an integral part of the tool solely dependant on a single external library or the library must be integrated in the package
  • Dont stop with filesystem detection after the first match. Often if a partition is reused with a new FS and is not all that full yet, much of the old FS can still be valid. I have seen this with ext2/fat. The fact that you have identified a valid FS on a partition doesn't mean there isn't an(almost) valid second FS that would yield additional files. Identifying doubly allocated space might in some cases also be relevant.
Joachim What your saying is that dealing with file system fragments should be part of the carving algorithm
  • Allow use where filesystem based carving is done by other tool, and the tool is used as second stage on (sets of) unallocated block (pseudo) files and/or non FS partition (pseudo) files.
Joachim I would not opt for this. The tool would be dependent on other tools and their data format, which makes the tool difficult to maintain. I would opt to integrate the functionality of having multiple recovery phases (stages) and allow the tooling to run the phases after one and other or separately.
  • Ability to be used as a library instead of a tool. Ability to access metadata true library, and thus the ability to set metadata from the carving modules. This would be extremely usefull for integrating the project into a framework like ocfa.
Joachim I guess most of the code could be integrated into libraries, but I would not opt integrating tool functionality into a library
  • Mark I think it would be very handy to have a CSV, TSV, XML or other delimited output (log)file with information about carved files. This output file can then be stored in a database or Excel sheet (report function)

Configuration language/specification

  • Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat .FUF 20:51, 28 October 2008 (UTC)
    • This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"

.FUF added: The main idea is to allow users to define structures, for example (in pascal-like form):

Field1: Byte = 123;
SomeTextLength: DWORD;
SomeText: string[SomeTextLength];
Field4: Char = 'r';
...

This will produce something like this:

Field1 = 123
SomeTextLength = 5
SomeText = 'abcd1'
Field4 = 'r'

(In text or raw forms.)

Opinions?

Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was). As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. RB

Joachim In my option your example is too limited. Making the revit configuration I learned you'll need a near programming language to specify some file formats. A simple descriptive language is too limiting. I would also go for 2 bytes with endianess instead of using terminology like WORD and small integer, it's much more clear. The configuration also needs to deal with aspects like cardinality, required and optional structures.

This is simply data structures carving, see ideas above. Somebody (I cannot track so many changes per day) separated the original text. There is no need to count and join different structures. .FUF 19:53, 31 October 2008 (UTC)
Joachim This was probably me is the text back in it's original form?
I started it by moving your Revit07 comment to the validator/plugin section in this edit, since I was still at that point thinking operational configuration for that section, not parser configurations.

Please take a look at the revit07 configuration. It's not there yet but goes a far way. Some things currently missing:

  • bitwise alignment
  • handling encapsulated streams (MPEG/capture files)
  • handling content based formats (MBOX)

Caving algorithm

Joachim

  • should we allow for multiple carving phases (runs/stages)?
I opt yes (separation of concern)
  • should we allow for multiple carving algorithms?
I opt yes, this allows testing of different approaches
  • Should the algorithm try to do as much in 1 run over the input data? To reduce IO?
I opt that the tool should allow for multiple and single run over the input data to minimize the IO or the CPU as bottleneck
  • Interaction between algorithm and validators
    • does the algorithm passes data blocks to the validators?
    • does a validator need to maintain a state?
    • does a validator need to revert a state?
    • How do we deal with embedded files and content validation? Do the validators call another validator?
  • do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?
  • Revit07 allows for multiple concurrent result files states to deal with fragmentation. One has the attribute of being active (the preferred) and the other passive. Do we want/need something similar? The algorithm adds block of input data (offsets) to these result files states.
    • if so what info would these result files states require (type, list of input data blocks)
  • how do we deal with file system remainders?
    • Can we abstract them and compare them against available file system information?
  • Do we carve file systems in files?
I opt that at least the validator uses this information

Caving scenarios

Joachim

  • normal file (file structure, loose text based structure (more a content structure?))
  • fragmented file (the file entirely exist)
  • a file fragment (the file does not entirely exist)
  • intertwined file
  • encapsulated file (MPEG/network capture)
  • embedded file (JPEG thumbnail)
  • obfuscation ('encrypted' PFF) this also entails encryption and/or compression
  • file system in file

File System Awareness

Background: Why be File System Aware?

Advantages of being FS aware:

  • You can pick up sector allocation sizes
Joachim do you mean file system block sizes?
  • Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
  • Increasingly file systems have compression (NTFS compression)
  • Carve just the sectors that are not in allocated files.

Tasks that would be required

Discussion

As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion. If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself. RB
I guess this tool operates like Selective file dumper and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. .FUF 07:08, 29 October 2008 (UTC)
This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later. The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well. This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort. RB

Joachim I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.

Joachim I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.

Supportive tooling

Joachim

  • validator (definitions) tester (detest in revit07)
  • tool to make configuration based definitions
  • post carving validation
  • the carver needs to provide support for fuse mount of carved files (carvfs)

Testing

Joachim

  • automated testing
  • test data

Validator Construction

Options:

  • Write validators in C/C++
    • Joachim you mean dedicated validators
  • Have a scripting language for writing them (python? Perl?) our own?
    • Joachim use easy to embed programming languages i.e. Phyton or Lua
  • Use existing programs (libjpeg?) as plug-in validators?
    • Joachim define a file structure api for this

Existing Code that we have

Joachim Please add any missing links

Documentation/Articles

  • DFRWS2006/2007 carving challenge results
  • DFRWS2008 paper on carving

Carvers

Possible file structure validator libraries

Input support

Volume/Partition support

File system support

  • TSK
  • photorec FS code
  • implementations of FS in Linux/BSD

Content support

Zero storage support

  • libcarvpath
  • carvfs

Implementation Timeline

  1. gather the available resources/ideas/wishes/needs etc. (I guess we're in this phase)
  2. start discussing a high level design (in terms of algorithm, facilities, information needed)
    1. input formats facility
    2. partition/volume facility
    3. file system facility
    4. file format facility
    5. content facility
    6. how to deal with fragment detection (do the validators allow for fragment detection?)
    7. how to deal with recombination of fragments
    8. do we want multiple carving phases in light of speed/precision tradeoffs
  3. start detailing parts of the design
    1. Discuss options for a grammar driven validator?
    2. Hard-coded plug-ins?
    3. Which existing code can we use?
  4. start building/assembling parts of the tooling for a prototype
    1. Implement simple file carving with validation.
    2. Implement gap carving
  5. Initial Release
  6. Implement the threaded carving that .FUF is describing above.

Joachim Shouldn't multi threaded carving (MTC) not be part of the 1st version? The MT approach makes for different design decisions