Difference between pages "Internet Explorer History File Format" and "Carver 2.0 Planning Page"

From Forensics Wiki
(Difference between pages)
Jump to: navigation, search
 
(File System Awareness)
 
Line 1: Line 1:
{{Expand}}
+
This page is for planning Carver 2.0.
[[Internet Explorer]] as of version 4 up to version 9 stores the web browsing history in files named <tt>index.dat</tt>. The files contain multiple records.
+
MSIE version 3 probably uses similar records in its History (Cache) files.
+
  
== File Locations ==
+
Please, do not delete text (ideas) here. Use something like this:
  
Internet Explorer history files keep a record of URLs that the browser has visited, cookies that were created by these sites, and any temporary internet files that were downloaded by the site visit.  As a result, Internet Explorer history files are kept in several locations.  Regardless of the information stored in the file, the file is named index.dat.
 
 
On Windows 95/98 these files were located in the following locations:
 
 
<pre>
 
<pre>
%systemdir%\Temporary Internet Files\Content.ie5
+
<s>bad idea</s>
%systemdir%\Cookies
+
:: good idea
%systemdir%\History\History.ie5
+
 
</pre>
 
</pre>
  
On Windows 2000/XP the file locations have changed:
+
This will look like:
<pre>
+
%systemdir%\Documents and Settings\%username%\Local Settings\Temporary Internet Files\Content.ie5
+
%systemdir%\Documents and Settings\%username%\Cookies
+
%systemdir%\Documents and Settings\%username%\Local Settings\History\history.ie5
+
</pre>
+
  
Internet Explorer also keeps daily, weekly, and monthly history logs that will be located in subfolders of %systemdir%\Documents and Settings\%username%\Local Settings\History\history.ie5.  The folders will be named <tt>MSHist<two-digit number><starting four-digit year><starting two-digit month><starting two-digit day><ending four-digit year><ending two-digit month><ending two-digit day></tt>.  For example, the folder containing data from March 26, 2008 to March 27, 2008 might be named <tt>MSHist012008032620080327</tt>.
+
<s>bad idea</s>
 +
:: good idea
  
Note that not every file named index.dat is a MSIE History (Cache) file.
+
= License =
  
== File Header ==
+
BSD-3.
Every version of Internet Explorer since Internet Explorer 5 has used the same structure for the file header and the individual records.  Internet Explorer history files begin with:
+
:: [[User:Joachim Metz|Joachim]] library based validator could need other licenses
43 6c 69 65 6e 74 20 55 72 6c 43 61 63 68 65 20 4d 4d 46 20 56 65 72 20 35 2e 32
+
Which represents the ascii string "Client UrlCache MMF Ver 5.2"
+
  
The next field in the file header starts at byte offset 28 and is a four byte representation of the file size.  The number will be stored in [[endianness | little-endian]] format so the numbers must actually be reversed to calculate the value.
+
= OS =
  
Also of interest in the file header is the location of the cache directories. In the URL records the cache directories are given as a number, with one representing the first cache directory, two representing the second and so on.  The names of the cache directories are kept at byte offset 64 in the fileEach directory entry is 12 bytes long of which the first eight bytes contain the directory name.
+
Linux/FreeBSD/MacOS
 +
: Shouldn't this just match what the underlying afflib & sleuthkit cover? [[User:RB|RB]]
 +
:: Yes, but you need to test and validate on each. Question: Do we want to support windows? [[User:Simsong|Simsong]] 21:09, 30 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] I think we would do wise to design with windows support from the start this will improve the platform independence from the start
 +
:::: Agreed; I would even settle at first for being able to run against CygwinNote that I don't even own or use a copy of Windows, but the vast majority of forensic investigators do. [[User:RB|RB]] 14:01, 31 October 2008 (UTC)
  
== Allocation bitmap ==
+
= Requirements =
The IE History File contains an allocation bitmap starting from offset 0x250 to 0x4000.
+
  
== Record Formats ==
+
* [[User:Joachim Metz|Joachim]] A name for the tooling I propose coldcut
 +
:: How about 'butcher'?  ;)  [[User:RB|RB]] 14:20, 31 October 2008 (UTC)
  
Every record has a similar header that consists of 8 bytes.
+
[[User:Joachim Metz|Joachim]] Could we do a MoSCoW evaluation of these.
  
<pre>typedef struct _RECORD_HEADER {
+
* AFF and EWF file images supported from scratch. ([[User:Joachim Metz|Joachim]] I would like to have raw/split raw and device access as well)
  /* 000 */ char        Signature[4];
+
:: If we base our image i/o on afflib, we get all three with one interface. [[User:RB|RB]]
  /* 004 */ uint32_t    NumberOfBlocksInRecord;
+
* [[User:Joachim Metz|Joachim]] volume/partition aware layer (what about carving unpartioned space)
} RECORD_HEADER;</pre>
+
* File system aware layer.
 +
** By default, files are not carved. (clarify: only identified? [[User:RB|RB]]; I guess that it operates like [[Selective file dumper]] [[User:.FUF|.FUF]] 07:00, 29 October 2008 (UTC))
 +
* Plug-in architecture for identification/validation.
 +
** [[User:Joachim Metz|Joachim]] support for multiple types of validators
 +
*** dedicated validator
 +
*** validator based on file library (i.e. we could specify/implement a file structure for these)
 +
*** configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)
 +
* Ship with validators for:
 +
[[User:Joachim Metz|Joachim]] I think we should distinguish between file format validators and content validators
 +
** JPEG
 +
** PNG
 +
** GIF
 +
** MSOLE
 +
** ZIP
 +
** TAR (gz/bz2)
  
The size of the record can be determined from the number of blocks in the record; per default the block size is 128 bytes. Therefore, a length of <pre>05 00 00 00</pre> would indicate five blocks (because the number is stored in little-endian format) of 128 bytes for a total record length of 640 bytes. Note that even for allocated records the number of blocks value cannot be fully relied upon.
+
[[User:Joachim Metz|Joachim]] For a production carver we need at least the following formats
 +
** Grapical Images
 +
*** JPEG (the 3 different types with JFIF/EXIF support)
 +
*** PNG
 +
*** GIF
 +
*** BMP
 +
*** TIFF
 +
** Office documents
 +
*** OLE2 (Word/Excell content support)
 +
*** PDF
 +
*** Open Office/Office 2007 (ZIP+XML)
 +
** Archive files
 +
*** ZIP
 +
*** 7z
 +
*** gzip
 +
*** bzip2
 +
*** tar
 +
*** RAR
 +
** E-mail files
 +
*** PFF (PST/OST)
 +
*** MBOX (text based format, base64 content support)
 +
** Audio/Video files
 +
*** MPEG
 +
*** MP2/MP3
 +
*** AVI
 +
*** ASF/WMV
 +
*** QuickTime
 +
*** MKV
 +
** Printer spool files
 +
*** EMF (if I remember correctly)
 +
** Internet history files
 +
*** index.dat
 +
*** firefox (sqllite 3)
 +
** Other files
 +
*** thumbs.db
 +
*** pagefile?
  
The blocks that make up a record can have slack space.  
+
* Simple fragment recovery carving using gap carving.
 +
** [[User:Joachim Metz|Joachim]] have hook in for more advanced fragment recovery?
 +
* Recovering of individual ZIP sections and JPEG icons that are not sector aligned.
 +
** [[User:Joachim Metz|Joachim]] I would propose a generic fragment detection and recovery
 +
* Autonomous operation (some mode of operation should be completely non-interactive, requiring no human intervention to complete [[User:RB|RB]])
 +
** [[User:Joachim Metz|Joachim]] as much as possible, but allow to be overwritten by user
 +
* Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
 +
** Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time) [[User:RB|RB]]
 +
** [[User:Joachim Metz|Joachim]] have multiple carving phases for precision/speed trade off?
 +
* Parallelizable
 +
** [[User:Joachim Metz|Joachim]] tunable for different architectures
 +
* Configuration:
 +
** Capability to parse some existing carvers' configuration files, either on-the-fly or as a one-way converter.
 +
** Disengage internal configuration structure from configuration files, create parsers that present the expected structure
 +
** [[User:Joachim Metz|Joachim]] The validator should deal with the file structure the carving algorithm should not know anything about the file structure (as in revit07 design)
 +
**  Either extend Scalpel/Foremost syntaxes for extended features or use a tertiary syntax ([[User:Joachim Metz|Joachim]] I would prefer a derivative of the revit07 configuration syntax which already has encountered some problems of dealing with defining file structure in a configuration file)
 +
* Can output audit.txt file.
 +
* [[User:Joachim Metz|Joachim]] Can output database with offset analysis values i.e. for visualization tooling
 +
* [[User:Joachim Metz|Joachim]] Can output debug log for debugging the algorithm/validation
 +
* Easy integration into ascription software.
 +
** [[User:Joachim Metz|Joachim]] I'm no native speaker what do you mean with "ascription software"?
 +
:: I think this was another non-native requesting easy scriptability. [[User:RB|RB]] 14:20, 31 October 2008 (UTC)
  
Currently 4 types of records are known:
+
= Ideas =
* URL
+
* Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
* REDR
+
** [[User:Joachim Metz|Joachim]] using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files)
* HASH
+
I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.
* LEAK
+
* Extracting/carving data from [[Thumbs.db]]? I've used [[foremost]] for it with some success. [[Vinetto]] has some critical bugs :( [[User:.FUF|.FUF]] 19:18, 28 October 2008 (UTC)
 +
* Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat [[User:.FUF|.FUF]] 20:51, 28 October 2008 (UTC)
 +
** This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"
  
Note that the location and filename strings are stored in the local codepage, normally these strings will only use the ASCII character set. Chinese versions of Windows are known to also use extended characters as well.
+
.FUF added:
 +
The main idea is to allow users to define structures, for example (in pascal-like form):
  
=== URL Records ===
+
<pre>
 +
Field1: Byte = 123;
 +
SomeTextLength: DWORD;
 +
SomeText: string[SomeTextLength];
 +
Field4: Char = 'r';
 +
...
 +
</pre>
  
These records indicate URIs that were actually requested. They contain the location and additional data like the web server's HTTP response. They begin with the header, in hexadecimal:
+
This will produce something like this:
 +
<pre>
 +
Field1 = 123
 +
SomeTextLength = 5
 +
SomeText = 'abcd1'
 +
Field4 = 'r'
 +
</pre>
  
<pre>55 52 4C 20</pre>
+
(In text or raw forms.)
This corresponds to the string <tt>URL</tt> followed by a space.
+
  
The definition for the structure in C99 format:
+
Opinions?
  
<pre>typedef struct _URL_RECORD_HEADER {
+
Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was).  As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. [[User:RB|RB]]
  /* 000 */ char        Signature[4];
+
  /* 004 */ uint32_t    AmountOfBlocksInRecord;
+
  /* 008 */ FILETIME    LastModified;
+
  /* 010 */ FILETIME    LastAccessed;
+
  /* 018 */ FATTIME    Expires;
+
  /* 01c */
+
  // Not finished yet
+
} URL_RECORD_HEADER;</pre>
+
  
<pre>
+
=Caving algorithm =
typedef struct _FILETIME {
+
[[User:Joachim Metz|Joachim]]
  /* 000 */ uint32_t    lower;
+
* should we allow for multiple runs?
  /* 004 */ uint32_t    upper;
+
* should we allow for multiple algorithms?
} FILETIME;</pre>
+
* does the algorithm passes data blocks to the validators?
 +
* does a validator need to maintain a state?
 +
* does a validator need to revert a state?
 +
* do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?
  
<pre>
+
==Caving scenarios ==
typedef struct _FATTIME {
+
[[User:Joachim Metz|Joachim]]
  /* 000 */ uint16_t    date;
+
* normal file (file structure, loose text based structure (more a content structure?))
  /* 002 */ uint16_t    time;
+
* fragmented file (the file entirely exist)
} FATTIME;</pre>
+
* a file fragment (the file does not entirely exist)
 +
* intertwined file
 +
* encapsulated file (MPEG/network capture)
 +
* embedded file (JPEG thumbnail)
 +
 
 +
=File System Awareness =
 +
==Background: Why be File System Aware?==
 +
Advantages of being FS aware:
 +
* You can pick up sector allocation sizes
 +
:: [[User:Joachim Metz|Joachim]] do you mean file system block sizes?
 +
* Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
 +
* Increasingly file systems have compression (NTFS compression)
 +
* Carve just the sectors that are not in allocated files.
 +
 
 +
==Tasks that would be required==
 +
 
 +
==Discussion==
 +
:: As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion.  If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself.  [[User:RB|RB]]
 +
 
 +
:::: I guess this tool operates like [[Selective file dumper]] and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. [[User:.FUF|.FUF]] 07:08, 29 October 2008 (UTC)
 +
 
 +
:: This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later.  The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well.  This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort.  [[User:RB|RB]]
 +
 
 +
[[User:Joachim Metz|Joachim]] I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.
 +
 
 +
[[User:Joachim Metz|Joachim]]
 +
I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.
  
The actual interpretation of the "LastModified" and "LastAccessed" fields depends on the type of history file in which the record is contained. As a matter of fact, Internet Explorer uses three different types of history files, namely Daily History, Weekly History, and Main History. Other "index.dat" files are used to store cached copies of visited pages and cookies.
+
=Supportive tooling=
The information concerning how to intepret the dates of these different files can be found on Capt. Steve Bunting's web page at the University of Delaware Computer Forensics Lab (http://www.stevebunting.org/udpd4n6/forensics/index_dat2.htm).
+
[[User:Joachim Metz|Joachim]]
Please be aware that most free and/or open source index.dat parsing programs, as well as quite a few commercial forensic tools, are not able to correctly interpret the above dates. More specifically, they interpret all the time and dates as if the records were contained into a Daily History file regardless of the actual type of the file they are stored in.
+
* validator (definitions) tester (detest in revit07)
 +
* tool to make configuration based definitions
 +
* post carving validation
 +
* the carver needs to provide support for fuse mount of carved files (carvfs)
  
=== REDR Records ===
+
=Testing =
REDR records are very simple records.  They simply indicate that the browser was redirected to another site.  REDR records always start with the string REDR (0x52 45  44 52).  The next four bytes are the size of the record in little endian format.  The size will indicate the number 128 byte blocks.
+
[[User:Joachim Metz|Joachim]]
 +
* automated testing
 +
* test data
  
At offset 8 from the start of the REDR record is an unknown data field.  It has been confirmed that this is not a date field.
+
=Validator Construction=
 +
Options:
 +
* Write validators in C/C++
 +
** [[User:Joachim Metz|Joachim]] you mean dedicated validators
 +
* Have a scripting language for writing them (python? Perl?) our own?
 +
** [[User:Joachim Metz|Joachim]] use easy to embed programming languages i.e. Phyton or Lua
 +
* Use existing programs (libjpeg?) as plug-in validators?
 +
** [[User:Joachim Metz|Joachim]] define a file structure api for this
  
16 bytes into the REDR record is the URL that was visited in a null-terminated string.  After the URL, the REDR record appears to be padded with zeros until the end of the 128 byte block.
+
=Existing Code that we have=
  
=== HASH Records ===
+
[[User:Joachim Metz|Joachim]]
 +
Carvers
 +
* DFRWS2006/2007 carving challenge results
 +
* DFRWS2008 paper on carving
 +
* photorec
 +
* revit06 and revit07
 +
* s3/scarve
  
=== LEAK Records ===
+
Possible file structure validator libraries
The exact purpose of LEAK records remains unknown, however research performed by Mike Murr suggests that LEAK records are created when the machine attempts to delete records from the history file while a corresponding Temporary Internet File (TIF) is held open and cannot be deleted.
+
* diverse existing file support libraries
 +
* libole2 (inhouse experimental code of OLE2 support)
 +
* libpff (alpha release for PFF (PST/OST) file support)
  
== See Also ==
+
Input support
 +
* AFF
 +
* EWF
 +
* TSK device & raw & split raw
  
* [[Internet Explorer]]
+
Volume/Partition support
 +
* disktype
 +
* testdisk
 +
* TSK
  
== External Links ==
+
File system support
 +
* TSK
 +
* photorec FS code
 +
* implementations of FS in Linux/BSD
  
* [http://www.cqure.net/wp/iehist/ IEHist program for reading index.dat files]
+
Content support
* [http://www.milincorporated.com/a3_index.dat.html What is in Index.dat files]
+
* [http://www.foundstone.com/us/pdf/wp_index_dat.pdf Detailed analysis of index.dat file format]
+
* [http://downloads.sourceforge.net/sourceforge/libmsiecf/MSIE_Cache_File_format.pdf MSIE Cache File (index.dat) format specification]
+
* [http://www.forensicblog.org/2009/09/10/the-meaning-of-leak-records/ The Meaning of LEAK records]
+
* [http://www.tzworks.net/prototype_page.php?proto_id=6 Windows 'index.dat' Parser] Free tool that can be run on Windows, Linux or Mac OS-X.
+
  
[[Category:File Formats]]
+
=Implementation Timeline=
 +
# gather the available resources/ideas/wishes/needs etc. (I guess we're in this phase)
 +
# start discussing a high level design (in terms of algorithm, facilities, information needed)
 +
## input formats facility
 +
## partition/volume facility
 +
## file system facility
 +
## file format facility
 +
## content facility
 +
## how to deal with fragment detection (do the validators allow for fragment detection?)
 +
## how to deal with recombination of fragments
 +
## do we want multiple carving phases in light of speed/precision tradeoffs
 +
# start detailing parts of the design
 +
## Discuss options for a grammar driven validator?
 +
## Hard-coded plug-ins?
 +
## Which exsisting code can we use?
 +
# start building/assembling parts of the tooling for a prototype
 +
## Implement simple file carving with validation.
 +
## Implement gap carving
 +
# Initial Release
 +
# Implement the ''threaded carving'' that [[User:.FUF|.FUF]] is describing above.

Revision as of 13:48, 31 October 2008

This page is for planning Carver 2.0.

Please, do not delete text (ideas) here. Use something like this:

<s>bad idea</s>
:: good idea

This will look like:

bad idea

good idea

Contents

License

BSD-3.

Joachim library based validator could need other licenses

OS

Linux/FreeBSD/MacOS

Shouldn't this just match what the underlying afflib & sleuthkit cover? RB
Yes, but you need to test and validate on each. Question: Do we want to support windows? Simsong 21:09, 30 October 2008 (UTC)
Joachim I think we would do wise to design with windows support from the start this will improve the platform independence from the start
Agreed; I would even settle at first for being able to run against Cygwin. Note that I don't even own or use a copy of Windows, but the vast majority of forensic investigators do. RB 14:01, 31 October 2008 (UTC)

Requirements

  • Joachim A name for the tooling I propose coldcut
How about 'butcher'?  ;) RB 14:20, 31 October 2008 (UTC)

Joachim Could we do a MoSCoW evaluation of these.

  • AFF and EWF file images supported from scratch. (Joachim I would like to have raw/split raw and device access as well)
If we base our image i/o on afflib, we get all three with one interface. RB
  • Joachim volume/partition aware layer (what about carving unpartioned space)
  • File system aware layer.
    • By default, files are not carved. (clarify: only identified? RB; I guess that it operates like Selective file dumper .FUF 07:00, 29 October 2008 (UTC))
  • Plug-in architecture for identification/validation.
    • Joachim support for multiple types of validators
      • dedicated validator
      • validator based on file library (i.e. we could specify/implement a file structure for these)
      • configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)
  • Ship with validators for:

Joachim I think we should distinguish between file format validators and content validators

    • JPEG
    • PNG
    • GIF
    • MSOLE
    • ZIP
    • TAR (gz/bz2)

Joachim For a production carver we need at least the following formats

    • Grapical Images
      • JPEG (the 3 different types with JFIF/EXIF support)
      • PNG
      • GIF
      • BMP
      • TIFF
    • Office documents
      • OLE2 (Word/Excell content support)
      • PDF
      • Open Office/Office 2007 (ZIP+XML)
    • Archive files
      • ZIP
      • 7z
      • gzip
      • bzip2
      • tar
      • RAR
    • E-mail files
      • PFF (PST/OST)
      • MBOX (text based format, base64 content support)
    • Audio/Video files
      • MPEG
      • MP2/MP3
      • AVI
      • ASF/WMV
      • QuickTime
      • MKV
    • Printer spool files
      • EMF (if I remember correctly)
    • Internet history files
      • index.dat
      • firefox (sqllite 3)
    • Other files
      • thumbs.db
      • pagefile?
  • Simple fragment recovery carving using gap carving.
    • Joachim have hook in for more advanced fragment recovery?
  • Recovering of individual ZIP sections and JPEG icons that are not sector aligned.
    • Joachim I would propose a generic fragment detection and recovery
  • Autonomous operation (some mode of operation should be completely non-interactive, requiring no human intervention to complete RB)
    • Joachim as much as possible, but allow to be overwritten by user
  • Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
    • Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time) RB
    • Joachim have multiple carving phases for precision/speed trade off?
  • Parallelizable
    • Joachim tunable for different architectures
  • Configuration:
    • Capability to parse some existing carvers' configuration files, either on-the-fly or as a one-way converter.
    • Disengage internal configuration structure from configuration files, create parsers that present the expected structure
    • Joachim The validator should deal with the file structure the carving algorithm should not know anything about the file structure (as in revit07 design)
    • Either extend Scalpel/Foremost syntaxes for extended features or use a tertiary syntax (Joachim I would prefer a derivative of the revit07 configuration syntax which already has encountered some problems of dealing with defining file structure in a configuration file)
  • Can output audit.txt file.
  • Joachim Can output database with offset analysis values i.e. for visualization tooling
  • Joachim Can output debug log for debugging the algorithm/validation
  • Easy integration into ascription software.
    • Joachim I'm no native speaker what do you mean with "ascription software"?
I think this was another non-native requesting easy scriptability. RB 14:20, 31 October 2008 (UTC)

Ideas

  • Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
    • Joachim using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files)

I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.

  • Extracting/carving data from Thumbs.db? I've used foremost for it with some success. Vinetto has some critical bugs :( .FUF 19:18, 28 October 2008 (UTC)
  • Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat .FUF 20:51, 28 October 2008 (UTC)
    • This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"

.FUF added: The main idea is to allow users to define structures, for example (in pascal-like form):

Field1: Byte = 123;
SomeTextLength: DWORD;
SomeText: string[SomeTextLength];
Field4: Char = 'r';
...

This will produce something like this:

Field1 = 123
SomeTextLength = 5
SomeText = 'abcd1'
Field4 = 'r'

(In text or raw forms.)

Opinions?

Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was). As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. RB

Caving algorithm

Joachim

  • should we allow for multiple runs?
  • should we allow for multiple algorithms?
  • does the algorithm passes data blocks to the validators?
  • does a validator need to maintain a state?
  • does a validator need to revert a state?
  • do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?

Caving scenarios

Joachim

  • normal file (file structure, loose text based structure (more a content structure?))
  • fragmented file (the file entirely exist)
  • a file fragment (the file does not entirely exist)
  • intertwined file
  • encapsulated file (MPEG/network capture)
  • embedded file (JPEG thumbnail)

File System Awareness

Background: Why be File System Aware?

Advantages of being FS aware:

  • You can pick up sector allocation sizes
Joachim do you mean file system block sizes?
  • Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
  • Increasingly file systems have compression (NTFS compression)
  • Carve just the sectors that are not in allocated files.

Tasks that would be required

Discussion

As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion. If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself. RB
I guess this tool operates like Selective file dumper and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. .FUF 07:08, 29 October 2008 (UTC)
This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later. The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well. This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort. RB

Joachim I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.

Joachim I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.

Supportive tooling

Joachim

  • validator (definitions) tester (detest in revit07)
  • tool to make configuration based definitions
  • post carving validation
  • the carver needs to provide support for fuse mount of carved files (carvfs)

Testing

Joachim

  • automated testing
  • test data

Validator Construction

Options:

  • Write validators in C/C++
    • Joachim you mean dedicated validators
  • Have a scripting language for writing them (python? Perl?) our own?
    • Joachim use easy to embed programming languages i.e. Phyton or Lua
  • Use existing programs (libjpeg?) as plug-in validators?
    • Joachim define a file structure api for this

Existing Code that we have

Joachim Carvers

  • DFRWS2006/2007 carving challenge results
  • DFRWS2008 paper on carving
  • photorec
  • revit06 and revit07
  • s3/scarve

Possible file structure validator libraries

  • diverse existing file support libraries
  • libole2 (inhouse experimental code of OLE2 support)
  • libpff (alpha release for PFF (PST/OST) file support)

Input support

  • AFF
  • EWF
  • TSK device & raw & split raw

Volume/Partition support

  • disktype
  • testdisk
  • TSK

File system support

  • TSK
  • photorec FS code
  • implementations of FS in Linux/BSD

Content support

Implementation Timeline

  1. gather the available resources/ideas/wishes/needs etc. (I guess we're in this phase)
  2. start discussing a high level design (in terms of algorithm, facilities, information needed)
    1. input formats facility
    2. partition/volume facility
    3. file system facility
    4. file format facility
    5. content facility
    6. how to deal with fragment detection (do the validators allow for fragment detection?)
    7. how to deal with recombination of fragments
    8. do we want multiple carving phases in light of speed/precision tradeoffs
  3. start detailing parts of the design
    1. Discuss options for a grammar driven validator?
    2. Hard-coded plug-ins?
    3. Which exsisting code can we use?
  4. start building/assembling parts of the tooling for a prototype
    1. Implement simple file carving with validation.
    2. Implement gap carving
  5. Initial Release
  6. Implement the threaded carving that .FUF is describing above.