Difference between pages "FAT" and "Carver 2.0 Planning Page"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
(File Allocation Table Structure)
 
(Requirements)
 
Line 1: Line 1:
=Technical Overview=
+
This page is for planning Carver 2.0.
  
FAT, or file allocation table, is a file system that is designed to keep track of allocation status of clusters on a hard drive.  Developed in 1977 by Microsoft Corporation, FAT was originally intended to be a file system for the Microsoft Disk BASIC interpreter.  FAT was quickly incorporated into an early version of Tim Patterson's QDOS, which was a moniker for "Quick and Dirty Operating System". Microsoft later purchased the rights to QDOS and released it under Microsoft branding as PC-DOS and later, MS-DOS.
+
Please, do not delete text (ideas) here. Use something like this:
  
FAT is a filesystem
+
<pre>
 +
<s>bad idea</s>
 +
:: good idea
 +
</pre>
  
== Structure==
+
This will look like:
[[Image:Yale fat16 diagram.jpg|thumb|Basic layout of the FAT16 file system.]]
+
The FAT file system is composed of several areas:
+
  
*  Boot Record or Boot Sector
+
<s>bad idea</s>
*  FATs
+
:: good idea
*  Root Directory or Root Folder
+
*  Data Area
+
*  Clusters
+
*  Wasted Sectors
+
  
=== Boot Record ===
+
= License =
When a computer is powered on, a POST (power-on self test) is performed, and control is then transferred to the MBR (Master Boot Record).  The MBR is present no matter what file system is in use, and contains information about how the storage device is logically partitioned.  When using a FAT file system, the MBR hands off control of the computer to the Boot Record, which is the first sector on the partition.  The Boot Record, which occupies a reserved area on the partition, contains executable code, in addition to information such as an OEM identifier, number of FATs, media descriptor (type of storage device), and information about the operating system to be booted.  Once the Boot Record code executes, control is handed off to the operating system installed on that partition.
+
  
=== FATs ===
+
BSD-3.
The primary task of the File Alocation Tables are to keep track of the allocation status of clusters, or logical groupings of sectors, on the disk drive. There are four different possible FAT entries: allocated (along with the address of the next cluster associated with the file), unallocated, end of file, and bad sector.
+
:: [[User:Joachim Metz|Joachim]] library based validators could require other licenses
 +
::: Make the other libraries plug-able. If you them, you use them. [[User:Simsong|Simsong]] 06:34, 3 November 2008 (UTC)
  
In order to provide redundancy in case of data corruption, two FATs, FAT1 and FAT2, are stored in the file system. FAT2 is a typically a duplicate of FAT1. However, FAT mirroring can be disabled on a FAT32 drive, thus enabling any of the FATs to become the Primary FAT. This possibly leaves FAT1 empty, which can be deceiving.
+
= OS =
  
=== Root Directory ===
+
Linux/FreeBSD/MacOS
The Root Directory, sometimes referred to as the Root Folder, contains an entry for each file and directory stored in the file system. This information includes the file name, starting cluster number, and file sizeThis information is changed whenever a file is created or subsequently modified. Root directory has a fixed size of 512 entries on a hard disk and the size on a floppy disk depends. With FAT32 it can be stored anywhere within the partition, although in previous versions it is always located immediately following the FAT region.
+
: Shouldn't this just match what the underlying afflib & sleuthkit cover? [[User:RB|RB]]
 +
:: Yes, but you need to test and validate on each. Question: Do we want to support windows? [[User:Simsong|Simsong]] 21:09, 30 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] I think we would do wise to design with windows support from the start this will improve the platform independence from the start
 +
:::: Agreed; I would even settle at first for being able to run against CygwinNote that I don't even own or use a copy of Windows, but the vast majority of forensic investigators do. [[User:RB|RB]] 14:01, 31 October 2008 (UTC)
 +
:: [[User:Capibara|Rob J Meijer]] Leaning heavily on the autotools might be the way to go. I do however feel that support requirements for windows would not be essential. Being able to run from a virtual machine with the main storage mounted over cifs should however be tested and if possible tuned extensively.
 +
:::: [[User:Joachim Metz|Joachim]] You'll need more than autotools to do native Windows support i.e. file access, UTF-16 support, wrap some basic system functions or have them available otherwise
 +
::::::[[User:Capibara|Rob J Meijer]] That´s exactly my point, windows support as in being able to build and run on windows natively is much more trouble than its worth. Better make for a lean and mean autotools based build with little dependencies and no or little recursion, and better spent effort on a lean POLA design on POSIX based systems than on supporting building and running on non POSIX systems.
  
=== Data Area ===
+
= Name tooling =
  
The Boot Record, FATs, and Root Directory are collectively referred to as the System Area. The remaining space on the logical drive is called the Data Area, which is where files are actually stored. It should be noted that when a file is deleted by the operating system, the data stored in the Data Area remains intact until it is overwritten.
+
* [[User:Joachim Metz|Joachim]] A name for the tooling I propose coldcut
 +
:: How about 'butcher'? ;) [[User:RB|RB]] 14:20, 31 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] cleaver ( scalpel on steroids ;-) )
 +
* I would like to propose Gouge or Chisel :-) [[User:Capibara|Rob J Meijer]]
  
=== Clusters ===
+
= Requirements =
In order for FAT to manage files with satisfactory efficiency, it groups sectors into larger blocks referred to as clusters. This is necessary because a computer only has a fixed number of memory addresses.  In modern computers, a typical drive has far more memory sectors than addresses.  Consequently, sectors are grouped together (into clusters) in order to share an address.  A cluster is the smallest unit of disk space that can be allocated to a file, which is why clusters are often called allocation units. Only the "data area" is divided into clusters, the rest of the partition is simply sectors. Cluster size is determined by the size of the disk volume and every file must be allocated an even number of clusters. Cluster sizing has a significant impact on performance and disk utilization. Larger cluster sizes result in more wasted space because files are less likely to fill up an even number of clusters.
+
  
The size of one cluster is specified in the Boot Record and can range from a single sector (512 bytes) to 128 sectors (65536 bytes). The sectors in a cluster are continuous, therefore each cluster is a continuous block of space on the disk. Note that only one file can be allocated to a clusterTherefore if a 1KB file is placed within a 32KB cluster there are 31KB of wasted space. The formula for determining clusters in a partition is (# of Sectors in Partition) - (# of Sectors per Fat * 2) - (# of Reserved Sectors) ) / (# of Sectors per Cluster).
+
[[User:Joachim Metz|Joachim]] Could we do a MoSCoW evaluation of these.
 +
* AFF and EWF file images supported from scratch. ([[User:Joachim Metz|Joachim]] I would like to have raw/split raw and device access as well)
 +
:: If we base our image i/o on afflib, we get all three with one interface. [[User:RB|RB]] Instead of letting the tools use afflib, better to write an afflib module for carvfs, and update the libewf module. The tool could than be oblivious of the file format. [[User:Capibara|Rob J Meijer]]
 +
:::: [[User:Simsong|Simsong]] 06:29, 3 November 2008 (UTC) The problem with using carvfs is that this adds another dependency. Do you really want to require that people install carvfs in order to run the carver? What about having the thing ported to Windows?
 +
:::::: [[User:Capibara|Rob J Meijer]] I would support adding one build dependency (libcarvpath) and removing two (libewf/libaff) by moving them to a layer more suited for them (carvfs) that would possibly allow some form of file handle (as cap) based POLA design. I am a proponent of making small things that do one and do one thing right, and to stack those to do what you need. In my view that would lead ideally to the following (simplified) chain:
 +
::::::* recursive-forensic-framework (ocfa/pyflag)
 +
::::::** <b>The-(pola-based)-carving-tool</b>
 +
::::::*** <b>The-carving-lib</b> working on open fd's. 
 +
::::::**** libcarvpath
 +
::::::***** carvfs (Over cifs/nfs-v4 on platforms that don't support Fuse).
 +
::::::****** libewf
 +
::::::****** libaff
 +
::::::*** AppArmor (on supporting platforms)
 +
::::::*** suid (on supporting platforms)
 +
::::::*** iptables/ipfw (on supporting platforms)
 +
:::::: As fow windows support, I would imagine making carvfs run over smb would come a long way, that is for as far as windows support is all that relevant.
 +
:::::: There are two advantages to using libcarvpath and carvfs instead of libaff/libewf t this layer:
 +
::::::* storage requirements for doing carving. Beyond what sleuthkit or alternatives provide I have seen many situations where carving was not done due to storage limitations.
 +
::::::* File handles are like object capabilities. You can often do pretty simple POLA based implementations using file handles and something like AppArmor. POLA could IMHO be a strong weapon against the more nasty forms of anti forensics.
 +
::::::Next to this, I would consider making different tools for different stages instead of one semi recursive one, and looking at how to integrate these tools into existing frameworks (ocfa/pyflag).
 +
::::::Keep things simple but rigid and try to easily integrate things into existing frameworks as effectively as possible I would suggest.
 +
::::::Please note, I am not ptoposing the lib/tool should be useless without libcarvpath, only that usage without carvfs should limit the
 +
::::::supported image formats to raw images, and that libewf/libaff should be abstracted at the Fuse level or below and not at the tool level.   
 +
:::::::[[User:Joachim Metz|Joachim]] do you have an idea what the performance impact of this approach would be? It might be wise to do a proof of concept for this approach first.
 +
:::: [[User:Joachim Metz|Joachim]] this layer should support multi threaded decompression of compressed image types, this speeds up IO
 +
* [[User:Joachim Metz|Joachim]] volume/partition aware layer (what about carving unpartioned space)
 +
* File system aware layer. This could be or make use of tsk-cp.
 +
** By default, files are not carved. (clarify: only identified? [[User:RB|RB]]; I guess that it operates like [[Selective file dumper]] [[User:.FUF|.FUF]] 07:00, 29 October 2008 (UTC)). Alternatively, the tool could use libcarvpath and output carvpaths or create a directory with symlinks to carvpaths that point into a carvfs mountpoint [[User:Capibara|Rob J Meijer]].
 +
* Plug-in architecture for identification/validation.
 +
** [[User:Joachim Metz|Joachim]] support for multiple types of validators
 +
*** dedicated validator
 +
*** validator based on file library (i.e. we could specify/implement a file structure API for these)
 +
*** configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)
  
=== Wasted Sectors ===
+
[[User:Joachim Metz|Joachim]]
 +
Moderator: Could we limit the requirements for prototype version 1 of the tool to get a working version up and running ASAP?
 +
And keep discussing future options?
  
Wasted Sectors are a result of the number of data sectors not being evenly distributed by the cluster size. It's made up of unused bytes left at the end of a file. Also, if the partition as declared in the partition table is larger than what is claimed in the Boot Record the volume can be said to have wasted sectors. Small files on a hard drive are the reason for wasted space and the bigger the hard drive the more wasted space there is.  This is because for a fixed number of memory addresses, a larger hard drive must have more sectors in each cluster.
+
I think the following set will be large enough to handle:
 +
Input facilities
 +
* IO support (AFF, device, EWF, RAW and split RAW)
 +
:: Abstraction of input format and multi threaded decompression (spin-off code out of afflib?)
 +
* Volume/Partitions support
 +
:: at least for DOS based layout and GPT (spin-off code out of TSK/Photorec?)
 +
* File system support
 +
:: VFAT/NTFS (spin-off code out of TSK/Photorec?)
  
=== FAT Entry Values ===
+
Carving facilities
<br>
+
* File format support using plug-able validator model (use dedicated validators Photorec/Scarve and/or wrap revit07 file format as validator?)
FAT12<br>
+
* Content support using plug-able validator model (to handle text/mbox base64)
<br>
+
* File system carving support (to handle file system fragments, could be linked to file system support layer?)
0x000          (Free Cluster)<br>   
+
* Basic fragment handling
0x001          (Reserved Cluster)<br>
+
0x002 - 0xFEF  (Used cluster; value points to next cluster)<br>
+
0xFF0 - 0xFF6  (Reserved values)<br>
+
0xFF7          (Bad cluster)<br>
+
0xFF8 - 0xFFF  (Last cluster in file)<br>
+
<br>
+
FAT16<br>
+
<br>
+
0x0000          (Free Cluster)<br>
+
0x0001          (Reserved Cluster)<br>
+
0x0002 - 0xFFEF  (Used cluster; value points to next cluster)<br>
+
0xFFF0 - 0xFFF6  (Reserved values)<br>
+
0xFFF7          (Bad cluster)<br>
+
0xFFF8 - 0xFFFF  (Last cluster in file)<br>
+
<br>
+
FAT32<br>
+
<br>
+
0x?0000000              (Free Cluster)<br>
+
0x?0000001              (Reserved Cluster)<br>
+
0x?0000002 - 0x?FFFFFEF  (Used cluster; value points to next cluster)<br>
+
0x?FFFFFF0 - 0x?FFFFFF6  (Reserved values)<br>
+
0x?FFFFFF7              (Bad cluster)<br>
+
0x?FFFFFF8 - 0x?FFFFFFF  (Last cluster in file)
+
  
Note: FAT32 uses only 28 of 32 possible bits, the upper 4 bits should be left alone. Typically these bits are zero, and are represented above by a question mark (?).
+
Output facilities
 +
* audit/analysis/debug log
 +
* extraction of result files
  
[[Category:Disk file systems]]
+
==Supported File Formats==
 +
* Ship with validators for:
 +
[[User:Joachim Metz|Joachim]] I think we should distinguish between file format validators and content validators
 +
** JPEG
 +
** PNG
 +
** GIF
 +
** MSOLE
 +
** ZIP
 +
** TAR (gz/bz2)
  
==Versions==
+
[[User:Joachim Metz|Joachim]] For a production carver we need at least the following formats
 +
** Grapical Images
 +
*** JPEG (the 3 different types with JFIF/EXIF support)
 +
*** PNG
 +
*** GIF
 +
*** BMP
 +
*** TIFF
 +
** Office documents
 +
*** OLE2 (Word/Excell content support)
 +
*** PDF
 +
*** Open Office/Office 2007 (ZIP+XML)
 +
:: Extension validation? AFAIK, MS Office 2007 [[DOCX]] format uses plain ZIP (or not?), and carved files will (or not?) have .zip extension instead of DOCX. Is there any way to fix this (may be using the file list in zip)? [[User:.FUF|.FUF]] 20:25, 31 October 2008 (UTC)
 +
:: [[User:Joachim Metz|Joachim]] Addition: Office 2007 also has a binary file format which is also a ZIP-ed data
  
There are three variants of FAT in existence: FAT12, FAT16, and FAT32.
 
  
'''FAT12'''
+
==Archive Files==
<br />
+
** Archive files
* FAT12 is the oldest type of FAT that uses a 12 bit file allocation table entry. 
+
*** ZIP
* FAT12 can hold a max of 4,086 clusters (which is 2<sup>12</sup> clusters minus a few values that are reserved for values used in  the FAT)
+
*** 7z
* It is used for floppy disks and hard drive partitions that are smaller than 16 MB. 
+
*** gzip
* All 1.44 MB 3.5" floppy disks are formatted using FAT12.
+
*** bzip2
* Cluster size that is used is between 0.5 KB to 4 KB.
+
*** tar
 +
*** RAR
 +
** E-mail files
 +
*** PFF (PST/OST)
 +
*** MBOX (text based format, base64 content support)
 +
** Audio/Video files
 +
*** MPEG
 +
*** MP2/MP3
 +
*** AVI
 +
*** ASF/WMV
 +
*** QuickTime
 +
*** MKV
 +
** Printer spool files
 +
*** EMF (if I remember correctly)
 +
** Internet history files
 +
*** index.dat
 +
*** firefox (sqllite 3)
 +
** Other files
 +
*** thumbs.db
 +
*** pagefile?
  
'''FAT16'''
+
==Carving Strategies==
<br/>
+
[[User:Joachim Metz|Joachim]] Note to moderator could this section be merged with the carving algorithm section?
*  It is called FAT16 because all entries are 16 bit.
+
*  FAT16 can hold a max of 65,536 addressable units (2 <sub>26</sub>
+
*  It is used for small and moderate sized hard disk volumes.
+
*  The actual capacity is 65,525 due to some reserved values
+
  
'''FAT32'''
+
* Simple fragment recovery carving using gap carving.
<br />
+
** [[User:Joachim Metz|Joachim]] have hook in for more advanced fragment recovery?
FAT32 is the enhanced version of the FAT system implemented beginning with Windows 95 OSR2, Windows 98, and Windows Me.
+
* Recovering of individual ZIP sections and JPEG icons that are not sector aligned.
Features include:
+
** [[User:Joachim Metz|Joachim]] I would propose a generic fragment detection and recovery
* Drives of up to 2 terabytes are supported (Windows 2000 only supports up to 32 gigabytes)
+
* Autonomous operation (some mode of operation should be completely non-interactive, requiring no human intervention to complete [[User:RB|RB]])
* Since FAT32 uses smaller clusters (of 4 kilobytes each), it uses hard drive space more efficiently. This is a 10 to 15 percent improvement over FAT or FAT16.
+
** [[User:Joachim Metz|Joachim]] as much as possible, but allow to be overwritten by user
* The limitations of FAT or FAT 16 on the number of root folder entries have been eliminated. In FAT32, the root folder is an ordinary cluster chain, and can be located anywhere on the drive.
+
* [[User:Joachim Metz|Joachim]] When the tool output files the filenames should contain the offset in the input data (in hexadecimal?)
* File allocation mirroring can be disabled in FAT32. This allows a different copy of the file allocation table then the default to be active.
+
:: [[User:Mark Stam|Mark]] I really like the fact carved files are named after the physical or logical sector in which the file is found (photorec)
<br />
+
:::: [[User:Joachim Metz|Joachim]] This naming schema might cause duplicate name problem for extracting embedded files and extracting files from non sector aligned file systems.
''Fat32 Limitations with Windows 2000 & Windows XP''
+
* [[User:Joachim Metz|Joachim]] Should the tool allow to export embedded files?
* Clusters cannot be 64KB or larger.
+
* [[User:Joachim Metz|Joachim]] Should the tool allow to export fragments separately?
* Cannot decrease cluster size that will result in the the FAT being larger than 16 MB minus 64KB in size.
+
* [[User:Mark Stam|Mark]] I personally use photorec often for carving files in the whole volume (not only unallocated clusters), so I can store information about all potential interesting files in MySQL
* Cannot contain fewer than 65,527 clusters.
+
:: [[User:Joachim Metz|Joachim]] interesting, Bas Kloet and me have been discussing to use information about allocated files in the recovery process, i.e. recovered fragments could be part of allocated files. Do we want to be able to extract them? Or could we rebuild the file from the fragments and the allocated files.
* Maximum of 32KB per cluster.
+
* [[User:Mark Stam|Mark]] It would also be nice if the files can be hashed immediately (MD5) so looking for them in other tools (for example Encase) is a snap
* ''Windows XP'': The Windows XP installation program will not allow a user to format a drive of more than 32GB using the FAT32 file system. Using the installation program, the only way to format a disk greater than 32GB in size is to use NTFS. A disk larger than 32GB in size ''can'' be formatted with FAT32 for use with Windows XP if the system is booted from a Windows 98 or Windows ME startup disk, and formatted using the tool that will be on the disk.
+
<br />
+
'''Comparison of FAT Versions'''
+
  
See the table at http://en.wikipedia.org/wiki/File_Allocation_Table for more detailed information about the various versions of FAT.
+
==Performance Requirements==
 +
* Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
 +
** Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time) [[User:RB|RB]]
 +
** [[User:Joachim Metz|Joachim]] have multiple carving phases for precision/speed trade off?
 +
* Parallelizable
 +
** [[User:Joachim Metz|Joachim]] tunable for different architectures
 +
* Configuration:
 +
** Capability to parse some existing carvers' configuration files, either on-the-fly or as a one-way converter.
 +
** Disengage internal configuration structure from configuration files, create parsers that present the expected structure
 +
** [[User:Joachim Metz|Joachim]] The validator should deal with the file structure the carving algorithm should not know anything about the file structure (as in revit07 design)
 +
**  Either extend Scalpel/Foremost syntaxes for extended features or use a tertiary syntax ([[User:Joachim Metz|Joachim]] I would prefer a derivative of the revit07 configuration syntax which already has encountered some problems of dealing with defining file structure in a configuration file)
  
==Applications of FAT==
+
==Output==
 +
* Can output audit.txt file.
 +
* [[User:Joachim Metz|Joachim]] Can output database with offset analysis values i.e. for visualization tooling
 +
* [[User:Joachim Metz|Joachim]] Can output debug log for debugging the algorithm/validation
 +
* Easy integration into ascription software.
 +
:: [[User:Joachim Metz|Joachim]] I'm no native speaker what do you mean with "ascription software"?
 +
::: I think this was another non-native requesting easy scriptability. [[User:RB|RB]] 14:20, 31 October 2008 (UTC)
 +
:::: [[User:Joachim Metz|Joachim]] that makes sense ;-)
 +
::::: Incorrect. Ascription software is software that determines who the owner of a file is. [[User:Simsong|Simsong]] 06:36, 3 November 2008 (UTC)
  
Due to its low cost, mobility, and non-volatile nature, flash memory has quickly become the choice medium for storing and transferring data in consumer electronic devices. The majority of flash memory storage is formatted using the FAT file system. In addition, FAT is also frequently used in electronic devices with miniature hard drives.
+
= Ideas =
 +
* Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
 +
:: [[User:Joachim Metz|Joachim]] using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files) I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.
 +
* Extracting/carving data from [[Thumbs.db]]? I've used [[foremost]] for it with some success. [[Vinetto]] has some critical bugs :( [[User:.FUF|.FUF]] 19:18, 28 October 2008 (UTC)
  
Examples of devices in which FAT is utilized include:
+
==Recursive Carving==
 +
[[User:Joachim Metz|Joachim]] do we want to support (let's call it) 'recursive in file carving' (for now) this is different from embedded files because there is a file system structure in the file and not just another file structure
 +
* Is it just me, or do a lot of the above (and below) ideas somewhat skirt around the fact that many of us want recursive carving?  Can we bend back to that instead of discussing object particulars?  I think this can be distilled down to three requirements:
 +
** Simple recursion: once an object is identified, have the ability to re-carve it for internal structures
 +
** Directed recursion: the carver should be able to be directed at arbitrary blobs and told to carve it as a specified type.  This allows programmatically more simple methods of dealing with unidentifiably compressed or encrypted data.  Or filesystem fragments.
 +
** Export: the ability to export an object (recognized or not) for later or external "recursion".  Should go without saying for a carver, but...
 +
:--[[User:RB|RB]] 18:45, 2 November 2008 (UTC)
 +
:: [[User:Simsong|Simsong]] 06:30, 3 November 2008 (UTC) pyflag already does recursive carving. Are we just going to reimplement pyflag as a single executable?
  
* USB thumb drives
+
==Library Dependencies==
* Digital cameras
+
[[User:Capibara|Rob J Meijer]] :
* Digital camcorders
+
* Use libcarvpath whenever possible and by default to avoid high storage requirements.
* Portable audio and video players
+
:: [[User:Joachim Metz|Joachim]] For easy deployment I would not opt for making an integral part of the tool solely dependant on a single external library or the library must be integrated in the package
* Multifunction printers
+
::[[User:Capibara|Rob J Meijer]] Integrating libraries (libtsk,libaff.libewf,libcarvpath etc) is bad practice, autotools are your friend IMO.
* Electronic photo frames
+
:: [[User:Joachim Metz|Joachim]] I'm not talking about integrating (shared) libraries. I'm talking about that an integral part of a tool should be part of it's package. Why can't the tool package contain shared or static libraries for local use? A far worse thing to do is to have a large set of dependencies and making the tool difficult to install for most users. The tool package should contain the most necessary code. afflib/libewf support could be detected by the autotools a neat separation of functionality.
* Electronic musical instruments
+
::: From a packager's standpoint, [[User:Joachim Metz|Joachim]]'s other libraries do a really good job of this, carrying around what they need but using a system-global version if available.  [[User:RB|RB]]
* Standard televisions
+
* PDAs
+
  
=Forensics Issues=
+
==Filesystem Detection==
==Data Recovery==
+
* Dont stop with filesystem detection after the first match. Often if a partition is reused with a new FS and is not all that full yet, much of the old FS can still be valid. I have seen this with ext2/fat. The fact that you have identified a valid FS on a partition doesn't mean there isn't an(almost) valid second FS that would yield additional files. Identifying doubly allocated space might in some cases also be relevant.
Recovering directory entries from FAT filesystems as part of [[recovering deleted data]] can be accomplished by looking for entries that begin with a sigma 0xe5. When a file or directory is deleted under a FAT filesystem, the first character of its name is changed to sigma. The remainder of the directory entry information remains intact.
+
:: [[User:Joachim Metz|Joachim]] What your saying is that dealing with file system fragments should be part of the carving algorithm
 +
* Allow use where filesystem based carving is done by other tool, and the tool is used as second stage on (sets of) unallocated block (pseudo) files and/or non FS partition (pseudo) files.
 +
:: [[User:Joachim Metz|Joachim]] I would not opt for this. The tool would be dependent on other tools and their data format, which makes the tool difficult to maintain. I would opt to integrate the functionality of having multiple recovery phases (stages) and allow the tooling to run the phases after one and other or separately.
 +
::[[User:Capibara|Rob J Meijer]] More generically, I feel a way should exist to communicate the 'left overs' a previous (non open, for example LE-only) tool left.
 +
:: [[User:Joachim Metz|Joachim]] I guess if the tool is designed to handle multiple phases it should store its data somewhere. So it should be possible to convert results of such non open tooling to the format required. However I would opt to design the recovery functionality of these non-open tools into open tools. And not to limit ourselves making translators due to the design of these non-open tools.
 +
* Ability to be used as a library instead of a tool. Ability to access metadata true library, and thus the ability to set metadata from the carving modules. This would be extremely usefull for integrating the project into a framework like ocfa.
 +
:: [[User:Joachim Metz|Joachim]] I guess most of the code could be integrated into libraries, but I would not opt integrating tool functionality into a library
 +
* A wild idea that I hope at least one person will have a liking for: It might be very interesting to look at the possibilities of using a multi process style of module support and combine it with a least authority design. On platforms that support AppArmor (or similar) and uid based firewall rules, this could make for the first true POLA (principle of least authority) based forensic tool ever. POLA based forensics tools should make for a strong integrity guard against many anti forensics. Alternatively we could look at integrating a capability secure language (E?) for implementation of at least validation modules. I don't expect this idea to make it, but mentioning it I hope might spark off less strong alternatives that at least partially address the integrity + anti-forensics problem. If we can in some way introduce POLA to a wider forensics public, other tools might also pick up on it what would be great.
 +
:: [[User:Joachim Metz|Joachim]] Could you give an example of how you see this in action?
 +
::::[[User:Capibara|Rob J Meijer]] I see two layers where using POLA could be applied. The best one would require one of the folowing as prerequisites:
 +
::::* The libaff/libewf layer is moved to a fuse implementation (for example carvfs).
 +
::::* Libewf/Libaff are updated to accept opened filhandles instead of demanding to open their own files.
 +
::::If one of these is fulfilled, than the tool running as some user can just have the simple task of opening the image files, starting up the 'real' tool and handing over the appropriate file handles. If the real tool runs with a restrictive AppArmor profile, and is started suid to a tool specific user that also has its own iptables uid based filter, than the real tool will run with least authority.
 +
:::: A second alternative, if neither of the first prerequisite could not be bet, would be to run the modules as confined processes and have a non confined process run as proxy for the first.
 +
:::: A third probably far fetched alternative would be to embed an object capability language in the tool and make the module interface thus that modules are to be written in this ocap language.
 +
::::A 4th alternative might include minorfs or plash, but I havn't geven those sufficient thinking hours yet.
  
The pointers are also changed to zero for each cluster used by the file.  Recovery tools look at the FAT to find the entry for the file.  The location of the starting cluster will still be in the directory file.  It is not deleted or modified.  The tool will go straight to that cluster and try to recover the file using the file size to determine the number of clusters to recover.  Some tools will go to the starting cluster and recover the next "X" number of clusters needed for the specific file size.  However, this tool is not ideal.  An ideal tool will locate "X" number of available clusters.  Since files are most often fragmented, this will be a more precise way to recover the file.
 
  
An issue arises when two files in the same row of clusters are deleted.  If the clusters are not in sequential order, the tool will automatically receive "X" number of clusters.  However, because the file was fragmented, it's most likely that all the clusters obtained will not all contain data for that file. If these two deleted files are in the same row of clusters, it is highly unlikely the file can be recovered.
+
* [[User:Mark Stam|Mark]] I think it would be very handy to have a CSV, TSV, XML or other delimited output (log)file with information about carved files. This output file can then be stored in a database or Excel sheet (report function)
  
==File Slack==
+
== Format syntax specification ==
File slack is data that starts from the end of the file written and continues to the end of the sectors designated to the file.   There are two types of file slack, RAM slack, and Residual slack. RAM slack starts from the end of the file and goes to the end of that sector.  Residual slack then starts at the next sector and goes to the end of the cluster allocated for the file.  File slack is a helpful tool when analyzing a hard drive because the old data that is not overwritten by the new file is still in tact. Go to http://www.pcguide.com/ref/hdd/file/partSizes-c.html for examples.
+
* Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat [[User:.FUF|.FUF]] 20:51, 28 October 2008 (UTC)
 +
** This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"
 +
.FUF added:
 +
The main idea is to allow users to define structures, for example (in pascal-like form):
  
<br/>
+
<pre>
 +
Field1: Byte = 123;
 +
SomeTextLength: DWORD;
 +
SomeText: string[SomeTextLength];
 +
Field4: Char = 'r';
 +
...
 +
</pre>
  
<table border="1" cellspacing="2" bordercolor="#000000" cellpadding="4" width="468" bordercolorlight="#C0C0C0">
+
This will produce something like this:
  <tr>
+
<pre>
    <td width="101" bgcolor="#808080"><font size="2"><b><center>Cluster</center></b></font></td>
+
Field1 = 123
    <td width="177" bgcolor="#808080"><font size="2"><b><center>Sample Slack Space,
+
SomeTextLength = 5
    50% Cluster Slack Per File</center></b></font></td>
+
SomeText = 'abcd1'
    <td width="178" bgcolor="#808080"><font size="2"><b><center>Sample Slack Space,
+
Field4 = 'r'
    67% Cluster Slack Per File</center></b></font></td>
+
</pre>
  </tr>
+
  <tr>
+
    <td width="101" bgcolor="#C0C0C0"><font size="2"><b><center>2 kiB</center></b></font></td>
+
    <td width="177"><font size="2"><center>17 MB</center></font></td>
+
    <td width="178"><font size="2"><center>22 MB</center></font></td>
+
  </tr>
+
  <tr>
+
    <td width="101" bgcolor="#C0C0C0"><font size="2"><b><center>4 kiB</center></b></font></td>
+
    <td width="177"><font size="2"><center>33 MB</center></font></td>
+
    <td width="178"><font size="2"><center>44 MB</center></font></td>
+
  </tr>
+
  <tr>
+
    <td width="101" bgcolor="#C0C0C0"><font size="2"><b><center>8 kiB</center></b></font></td>
+
    <td width="177"><font size="2"><center>66 MB</center></font></td>
+
    <td width="178"><font size="2"><center>89 MB</center></font></td>
+
  </tr>
+
  <tr>
+
    <td width="101" bgcolor="#C0C0C0"><font size="2"><b><center>16 kiB</center></b></font></td>
+
    <td width="177"><font size="2"><center>133 MB</center></font></td>
+
    <td width="178"><font size="2"><center>177 MB</center></font></td>
+
  </tr>
+
  <tr>
+
    <td width="101" bgcolor="#C0C0C0"><font size="2"><b><center>32 kiB</center></b></font></td>
+
    <td width="177"><font size="2"><center>265 MB</center></font></td>
+
    <td width="178"><font size="2"><center>354 MB</center></font></td>
+
  </tr>
+
</table>
+
  
 +
(In text or raw forms.)
  
The diagram above demonstrates the larger the cluster size used, the more disk space is wasted due to slack. This suggests it is better to use smaller cluster sizes whenever possible.
+
Opinions?
  
<br/>
+
Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was).  As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. [[User:RB|RB]]
  
=FAT vs. NTFS=
+
[[User:Joachim Metz|Joachim]]
 +
In my option your example is too limited. Making the revit configuration I learned you'll need a near programming language to specify some file formats.
 +
A simple descriptive language is too limiting. I would also go for 2 bytes with endianess instead of using terminology like WORD and small integer, it's much more clear. The configuration also needs to deal with aspects like cardinality, required and optional structures.
 +
:: This is simply data structures carving, see ideas above. Somebody (I cannot track so many changes per day) separated the original text. There is no need to count and join different structures. [[User:.FUF|.FUF]] 19:53, 31 October 2008 (UTC)
 +
:::: [[User:Joachim Metz|Joachim]] This was probably me is the text back in it's original form?
 +
:::: I started it by moving your Revit07 comment to the validator/plugin section in [http://www.forensicswiki.org/index.php?title=Carver_2.0_Planning_Page&diff=prev&oldid=7583 this edit], since I was still at that point thinking operational configuration for that section, not parser configurations. [[User:RB|RB]]
 +
:::: [[User:Joachim Metz|Joachim]] I renamed the title to format syntax, clarity is important ;-)
  
==FAT Advantages==
+
Please take a look at the revit07 format syntax specification (configuration). It's not there yet but goes a far way. Some things currently missing:
*  Files available to multiple operating systems on the same computer
+
* bitwise alignment
*  Easier to switch from FAT to NTFS than vice versa
+
* handling encapsulated streams (MPEG/capture files)
*  Performs faster on smaller volumns (< 10GB)
+
* handling content based formats (MBOX)
*  Does not index files which causes slightly higher performance
+
* Performs better with small cache sizes (< 96MB)
+
* More space effiecent on small volumes (< 4GB)
+
* Performs better with slow disks (< 5400RPM)
+
  
==FAT Disadvantages==
+
=Caving algorithm =
* FAT has a fixed maximum number of clusters per partition, which means as the hard disk gets bigger the size of each cluster must increase, creating more slack space
+
[[User:Joachim Metz|Joachim]]
* Doesn't nativley support many abilities of NTFS such as compression, encryption, or advanced security using access control lists
+
* should we allow for multiple carving phases (runs/stages)?
* NTFS recommened by Microsoft for volumes larger than 32GB
+
:: I opt yes (separation of concern)
* FAT slows down as the number of files on the disk increases
+
* should we allow for multiple carving algorithms?
* FAT usually fragments files more
+
:: I opt yes, this allows testing of different approaches
* FAT does not allow for indexing of files for faster searching
+
* Should the algorithm try to do as much in 1 run over the input data? To reduce IO?
* FAT does not support user quotas
+
:: I opt that the tool should allow for multiple and single run over the input data to minimize the IO or the CPU as bottleneck
 +
* Interaction between algorithm and validators
 +
** does the algorithm passes data blocks to the validators?
 +
** does a validator need to maintain a state?
 +
** does a validator need to revert a state?
 +
** How do we deal with embedded files and content validation? Do the validators call another validator?
 +
* do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?
 +
* Revit07 allows for multiple concurrent result files states to deal with fragmentation. One has the attribute of being active (the preferred) and the other passive. Do we want/need something similar? The algorithm adds block of input data (offsets) to these result files states.
 +
** if so what info would these result files states require (type, list of input data blocks)
 +
* how do we deal with file system remainders?
 +
** Can we abstract them and compare them against available file system information?
 +
* Do we carve file systems in files?
 +
:: I opt that at least the validator uses this information
  
 +
==Caving scenarios ==
 +
[[User:Joachim Metz|Joachim]]
 +
* normal file (file structure, loose text based structure (more a content structure?))
 +
* fragmented file (the file entirely exist)
 +
* a file fragment (the file does not entirely exist)
 +
* intertwined file
 +
* encapsulated file (MPEG/network capture)
 +
* embedded file (JPEG thumbnail)
 +
* obfuscation ('encrypted' PFF) this also entails encryption and/or compression
 +
* file system in file
  
== External links ==
+
=File System Awareness =
* http://en.wikipedia.org/wiki/File_Allocation_Table
+
==Background: Why be File System Aware?==
* http://www.microsoft.com
+
Advantages of being FS aware:
* http://www.ntfs.com
+
* You can pick up sector allocation sizes
* http://www.ntfs.com/ntfs_vs_fat.htm
+
:: [[User:Joachim Metz|Joachim]] do you mean file system block sizes?
* http://support.microsoft.com/kb/q154997/#XSLTH3126121123120121120120
+
* Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
* http://www.dewassoc.com/kbase/hard_drives/boot_sector.htm
+
* Increasingly file systems have compression (NTFS compression)
* http://home.teleport.com/~brainy/fat32.htm
+
* Carve just the sectors that are not in allocated files.
* http://www2.tech.purdue.edu/cpt/courses/cpt499s/
+
 
* http://home.no.net/tkos/info/fat.html
+
==Tasks that would be required==
* http://web.ukonline.co.uk/cook/fat32.htm
+
 
* http://www.ntfs.com/fat-systems.htm
+
==Discussion==
* http://www.microsoft.com/whdc/system/platform/firmware/fatgen.mspx
+
:: As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion.  If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself.  [[User:RB|RB]]
* http://support.microsoft.com/kb/q140418
+
 
 +
:::: I guess this tool operates like [[Selective file dumper]] and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. [[User:.FUF|.FUF]] 07:08, 29 October 2008 (UTC)
 +
 
 +
:: This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later.  The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well.  This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort.  [[User:RB|RB]]
 +
 
 +
:: [[User:Joachim Metz|Joachim]] A design problem might be that TSK currently is a single library operating on multiple layers (storage media IO, volume/partition analysis and file system analysis). I'm not aware how easily the parts can be used separately. But I estimate that for the carver we want to use these 3 layers differently than TSK currently does.
 +
 
 +
[[User:Joachim Metz|Joachim]] I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.
 +
 
 +
[[User:Joachim Metz|Joachim]]
 +
I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.
 +
 
 +
=Supportive tooling=
 +
[[User:Joachim Metz|Joachim]]
 +
* validator (definitions) tester (detest in revit07)
 +
* tool to make configuration based definitions
 +
* post carving validation
 +
* the carver needs to provide support for fuse mount of carved files (carvfs)
 +
 
 +
=Testing =
 +
[[User:Joachim Metz|Joachim]]
 +
* automated testing
 +
* test data
 +
 
 +
=Validator Construction=
 +
Options:
 +
* Write validators in C/C++
 +
** [[User:Joachim Metz|Joachim]] you mean dedicated validators
 +
* Have a scripting language for writing them (python? Perl?) our own?
 +
** [[User:Joachim Metz|Joachim]] use easy to embed programming languages i.e. Phyton or Lua
 +
* Use existing programs (libjpeg?) as plug-in validators?
 +
** [[User:Joachim Metz|Joachim]] define a file structure api for this
 +
 
 +
=Existing Code that we have=
 +
[[User:Joachim Metz|Joachim]]
 +
Please add any missing links
 +
 
 +
Documentation/Articles
 +
* DFRWS2006/2007 carving challenge results
 +
* DFRWS2008 paper on carving
 +
 
 +
Carvers
 +
* DFRWS2006/2007 carving challenge results
 +
* photorec (http://www.cgsecurity.org/wiki/PhotoRec)
 +
* revit06 and revit07 (http://sourceforge.net/projects/revit/)
 +
* s3/scarve
 +
 
 +
Possible file structure validator libraries
 +
* divers existing file support libraries
 +
* libole2 (inhouse experimental code of OLE2 support)
 +
* libpff (alpha release for PFF (PST/OST) file support) (http://sourceforge.net/projects/libpff/)
 +
 
 +
Input support
 +
* AFF (http://www.afflib.org/)
 +
* EWF (http://sourceforge.net/projects/libewf/)
 +
* TSK device & raw & split raw (http://www.sleuthkit.org/)
 +
 
 +
Volume/Partition support
 +
* disktype (http://disktype.sourceforge.net/)
 +
* testdisk (http://www.cgsecurity.org/wiki/TestDisk)
 +
* TSK
 +
 
 +
File system support
 +
* TSK
 +
* photorec FS code
 +
* implementations of FS in Linux/BSD
 +
 
 +
Content support
 +
 
 +
Zero storage support
 +
* libcarvpath ( http://sourceforge.net/project/showfiles.php?group_id=170249&package_id=210704 )
 +
* carvfs ( http://sourceforge.net/project/showfiles.php?group_id=170249&package_id=210954 )
 +
* tsk-cp ( http://sourceforge.net/project/showfiles.php?group_id=170249&package_id=267227 )
 +
* carvfsmodewf (http://sourceforge.net/project/showfiles.php?group_id=170249&package_id=268256 )
 +
POLA
 +
* joe-e (java) ( http://code.google.com/p/joe-e/ )
 +
* Emily (ocaml)  ( http://erights.org/download/emily/ )
 +
* the E language ( http://www.erights.org/ )
 +
* AppArmor
 +
* iptables/ipfw
 +
* minorfs ( http://polacanthus.net/minorfs.html )
 +
* plash ( http://plash.beasts.org/wiki/ )
 +
 
 +
=Implementation Timeline=
 +
# gather the available resources/ideas/wishes/needs etc. (I guess we're in this phase)
 +
# start discussing a high level design (in terms of algorithm, facilities, information needed)
 +
## input formats facility
 +
## partition/volume facility
 +
## file system facility
 +
## file format facility
 +
## content facility
 +
## how to deal with fragment detection (do the validators allow for fragment detection?)
 +
## how to deal with recombination of fragments
 +
## do we want multiple carving phases in light of speed/precision tradeoffs
 +
# start detailing parts of the design
 +
## Discuss options for a grammar driven validator?
 +
## Hard-coded plug-ins?
 +
## Which existing code can we use?
 +
# start building/assembling parts of the tooling for a prototype
 +
## Implement simple file carving with validation.
 +
## Implement gap carving
 +
# Initial Release
 +
# Implement the ''threaded carving'' that [[User:.FUF|.FUF]] is describing above.
 +
 
 +
[[User:Joachim Metz|Joachim]] Shouldn't multi threaded carving (MTC) not be part of the 1st version?
 +
The MT approach makes for different design decisions
 +
: It is virtually impossible to turn a non-MT application into an MT application .[[User:Simsong|Simsong]] 06:37, 3 November 2008 (UTC)

Revision as of 11:11, 3 November 2008

This page is for planning Carver 2.0.

Please, do not delete text (ideas) here. Use something like this:

<s>bad idea</s>
:: good idea

This will look like:

bad idea

good idea

License

BSD-3.

Joachim library based validators could require other licenses
Make the other libraries plug-able. If you them, you use them. Simsong 06:34, 3 November 2008 (UTC)

OS

Linux/FreeBSD/MacOS

Shouldn't this just match what the underlying afflib & sleuthkit cover? RB
Yes, but you need to test and validate on each. Question: Do we want to support windows? Simsong 21:09, 30 October 2008 (UTC)
Joachim I think we would do wise to design with windows support from the start this will improve the platform independence from the start
Agreed; I would even settle at first for being able to run against Cygwin. Note that I don't even own or use a copy of Windows, but the vast majority of forensic investigators do. RB 14:01, 31 October 2008 (UTC)
Rob J Meijer Leaning heavily on the autotools might be the way to go. I do however feel that support requirements for windows would not be essential. Being able to run from a virtual machine with the main storage mounted over cifs should however be tested and if possible tuned extensively.
Joachim You'll need more than autotools to do native Windows support i.e. file access, UTF-16 support, wrap some basic system functions or have them available otherwise
Rob J Meijer That´s exactly my point, windows support as in being able to build and run on windows natively is much more trouble than its worth. Better make for a lean and mean autotools based build with little dependencies and no or little recursion, and better spent effort on a lean POLA design on POSIX based systems than on supporting building and running on non POSIX systems.

Name tooling

  • Joachim A name for the tooling I propose coldcut
How about 'butcher'?  ;) RB 14:20, 31 October 2008 (UTC)
Joachim cleaver ( scalpel on steroids ;-) )

Requirements

Joachim Could we do a MoSCoW evaluation of these.

  • AFF and EWF file images supported from scratch. (Joachim I would like to have raw/split raw and device access as well)
If we base our image i/o on afflib, we get all three with one interface. RB Instead of letting the tools use afflib, better to write an afflib module for carvfs, and update the libewf module. The tool could than be oblivious of the file format. Rob J Meijer
Simsong 06:29, 3 November 2008 (UTC) The problem with using carvfs is that this adds another dependency. Do you really want to require that people install carvfs in order to run the carver? What about having the thing ported to Windows?
Rob J Meijer I would support adding one build dependency (libcarvpath) and removing two (libewf/libaff) by moving them to a layer more suited for them (carvfs) that would possibly allow some form of file handle (as cap) based POLA design. I am a proponent of making small things that do one and do one thing right, and to stack those to do what you need. In my view that would lead ideally to the following (simplified) chain:
  • recursive-forensic-framework (ocfa/pyflag)
    • The-(pola-based)-carving-tool
      • The-carving-lib working on open fd's.
        • libcarvpath
          • carvfs (Over cifs/nfs-v4 on platforms that don't support Fuse).
            • libewf
            • libaff
      • AppArmor (on supporting platforms)
      • suid (on supporting platforms)
      • iptables/ipfw (on supporting platforms)
As fow windows support, I would imagine making carvfs run over smb would come a long way, that is for as far as windows support is all that relevant.
There are two advantages to using libcarvpath and carvfs instead of libaff/libewf t this layer:
  • storage requirements for doing carving. Beyond what sleuthkit or alternatives provide I have seen many situations where carving was not done due to storage limitations.
  • File handles are like object capabilities. You can often do pretty simple POLA based implementations using file handles and something like AppArmor. POLA could IMHO be a strong weapon against the more nasty forms of anti forensics.
Next to this, I would consider making different tools for different stages instead of one semi recursive one, and looking at how to integrate these tools into existing frameworks (ocfa/pyflag).
Keep things simple but rigid and try to easily integrate things into existing frameworks as effectively as possible I would suggest.
Please note, I am not ptoposing the lib/tool should be useless without libcarvpath, only that usage without carvfs should limit the
supported image formats to raw images, and that libewf/libaff should be abstracted at the Fuse level or below and not at the tool level.
Joachim do you have an idea what the performance impact of this approach would be? It might be wise to do a proof of concept for this approach first.
Joachim this layer should support multi threaded decompression of compressed image types, this speeds up IO
  • Joachim volume/partition aware layer (what about carving unpartioned space)
  • File system aware layer. This could be or make use of tsk-cp.
    • By default, files are not carved. (clarify: only identified? RB; I guess that it operates like Selective file dumper .FUF 07:00, 29 October 2008 (UTC)). Alternatively, the tool could use libcarvpath and output carvpaths or create a directory with symlinks to carvpaths that point into a carvfs mountpoint Rob J Meijer.
  • Plug-in architecture for identification/validation.
    • Joachim support for multiple types of validators
      • dedicated validator
      • validator based on file library (i.e. we could specify/implement a file structure API for these)
      • configuration based validator (Can handle config files,like Revit07, to enter different file formats used by the carver.)

Joachim Moderator: Could we limit the requirements for prototype version 1 of the tool to get a working version up and running ASAP? And keep discussing future options?

I think the following set will be large enough to handle: Input facilities

  • IO support (AFF, device, EWF, RAW and split RAW)
Abstraction of input format and multi threaded decompression (spin-off code out of afflib?)
  • Volume/Partitions support
at least for DOS based layout and GPT (spin-off code out of TSK/Photorec?)
  • File system support
VFAT/NTFS (spin-off code out of TSK/Photorec?)

Carving facilities

  • File format support using plug-able validator model (use dedicated validators Photorec/Scarve and/or wrap revit07 file format as validator?)
  • Content support using plug-able validator model (to handle text/mbox base64)
  • File system carving support (to handle file system fragments, could be linked to file system support layer?)
  • Basic fragment handling

Output facilities

  • audit/analysis/debug log
  • extraction of result files

Supported File Formats

  • Ship with validators for:

Joachim I think we should distinguish between file format validators and content validators

    • JPEG
    • PNG
    • GIF
    • MSOLE
    • ZIP
    • TAR (gz/bz2)

Joachim For a production carver we need at least the following formats

    • Grapical Images
      • JPEG (the 3 different types with JFIF/EXIF support)
      • PNG
      • GIF
      • BMP
      • TIFF
    • Office documents
      • OLE2 (Word/Excell content support)
      • PDF
      • Open Office/Office 2007 (ZIP+XML)
Extension validation? AFAIK, MS Office 2007 DOCX format uses plain ZIP (or not?), and carved files will (or not?) have .zip extension instead of DOCX. Is there any way to fix this (may be using the file list in zip)? .FUF 20:25, 31 October 2008 (UTC)
Joachim Addition: Office 2007 also has a binary file format which is also a ZIP-ed data


Archive Files

    • Archive files
      • ZIP
      • 7z
      • gzip
      • bzip2
      • tar
      • RAR
    • E-mail files
      • PFF (PST/OST)
      • MBOX (text based format, base64 content support)
    • Audio/Video files
      • MPEG
      • MP2/MP3
      • AVI
      • ASF/WMV
      • QuickTime
      • MKV
    • Printer spool files
      • EMF (if I remember correctly)
    • Internet history files
      • index.dat
      • firefox (sqllite 3)
    • Other files
      • thumbs.db
      • pagefile?

Carving Strategies

Joachim Note to moderator could this section be merged with the carving algorithm section?

  • Simple fragment recovery carving using gap carving.
    • Joachim have hook in for more advanced fragment recovery?
  • Recovering of individual ZIP sections and JPEG icons that are not sector aligned.
    • Joachim I would propose a generic fragment detection and recovery
  • Autonomous operation (some mode of operation should be completely non-interactive, requiring no human intervention to complete RB)
    • Joachim as much as possible, but allow to be overwritten by user
  • Joachim When the tool output files the filenames should contain the offset in the input data (in hexadecimal?)
Mark I really like the fact carved files are named after the physical or logical sector in which the file is found (photorec)
Joachim This naming schema might cause duplicate name problem for extracting embedded files and extracting files from non sector aligned file systems.
  • Joachim Should the tool allow to export embedded files?
  • Joachim Should the tool allow to export fragments separately?
  • Mark I personally use photorec often for carving files in the whole volume (not only unallocated clusters), so I can store information about all potential interesting files in MySQL
Joachim interesting, Bas Kloet and me have been discussing to use information about allocated files in the recovery process, i.e. recovered fragments could be part of allocated files. Do we want to be able to extract them? Or could we rebuild the file from the fragments and the allocated files.
  • Mark It would also be nice if the files can be hashed immediately (MD5) so looking for them in other tools (for example Encase) is a snap

Performance Requirements

  • Tested on 500GB-sized images. Should be able to carve a 500GB image in roughly 50% longer than it takes to read the image.
    • Perhaps allocate a percentage budget per-validator (i.e. each validator adds N% to the carving time) RB
    • Joachim have multiple carving phases for precision/speed trade off?
  • Parallelizable
    • Joachim tunable for different architectures
  • Configuration:
    • Capability to parse some existing carvers' configuration files, either on-the-fly or as a one-way converter.
    • Disengage internal configuration structure from configuration files, create parsers that present the expected structure
    • Joachim The validator should deal with the file structure the carving algorithm should not know anything about the file structure (as in revit07 design)
    • Either extend Scalpel/Foremost syntaxes for extended features or use a tertiary syntax (Joachim I would prefer a derivative of the revit07 configuration syntax which already has encountered some problems of dealing with defining file structure in a configuration file)

Output

  • Can output audit.txt file.
  • Joachim Can output database with offset analysis values i.e. for visualization tooling
  • Joachim Can output debug log for debugging the algorithm/validation
  • Easy integration into ascription software.
Joachim I'm no native speaker what do you mean with "ascription software"?
I think this was another non-native requesting easy scriptability. RB 14:20, 31 October 2008 (UTC)
Joachim that makes sense ;-)
Incorrect. Ascription software is software that determines who the owner of a file is. Simsong 06:36, 3 November 2008 (UTC)

Ideas

  • Use as much TSK if possible. Don't carry your own FS implementation the way photorec does.
Joachim using TSK as much as possible would not allow to add your own file system support (i.e. mobile phones, memory structures, cap files) I would propose wrapping TSK and using it as much as possible but allow to integrate own FS implementations.
  • Extracting/carving data from Thumbs.db? I've used foremost for it with some success. Vinetto has some critical bugs :( .FUF 19:18, 28 October 2008 (UTC)

Recursive Carving

Joachim do we want to support (let's call it) 'recursive in file carving' (for now) this is different from embedded files because there is a file system structure in the file and not just another file structure

  • Is it just me, or do a lot of the above (and below) ideas somewhat skirt around the fact that many of us want recursive carving? Can we bend back to that instead of discussing object particulars? I think this can be distilled down to three requirements:
    • Simple recursion: once an object is identified, have the ability to re-carve it for internal structures
    • Directed recursion: the carver should be able to be directed at arbitrary blobs and told to carve it as a specified type. This allows programmatically more simple methods of dealing with unidentifiably compressed or encrypted data. Or filesystem fragments.
    • Export: the ability to export an object (recognized or not) for later or external "recursion". Should go without saying for a carver, but...
--RB 18:45, 2 November 2008 (UTC)
Simsong 06:30, 3 November 2008 (UTC) pyflag already does recursive carving. Are we just going to reimplement pyflag as a single executable?

Library Dependencies

Rob J Meijer :

  • Use libcarvpath whenever possible and by default to avoid high storage requirements.
Joachim For easy deployment I would not opt for making an integral part of the tool solely dependant on a single external library or the library must be integrated in the package
Rob J Meijer Integrating libraries (libtsk,libaff.libewf,libcarvpath etc) is bad practice, autotools are your friend IMO.
Joachim I'm not talking about integrating (shared) libraries. I'm talking about that an integral part of a tool should be part of it's package. Why can't the tool package contain shared or static libraries for local use? A far worse thing to do is to have a large set of dependencies and making the tool difficult to install for most users. The tool package should contain the most necessary code. afflib/libewf support could be detected by the autotools a neat separation of functionality.
From a packager's standpoint, Joachim's other libraries do a really good job of this, carrying around what they need but using a system-global version if available. RB

Filesystem Detection

  • Dont stop with filesystem detection after the first match. Often if a partition is reused with a new FS and is not all that full yet, much of the old FS can still be valid. I have seen this with ext2/fat. The fact that you have identified a valid FS on a partition doesn't mean there isn't an(almost) valid second FS that would yield additional files. Identifying doubly allocated space might in some cases also be relevant.
Joachim What your saying is that dealing with file system fragments should be part of the carving algorithm
  • Allow use where filesystem based carving is done by other tool, and the tool is used as second stage on (sets of) unallocated block (pseudo) files and/or non FS partition (pseudo) files.
Joachim I would not opt for this. The tool would be dependent on other tools and their data format, which makes the tool difficult to maintain. I would opt to integrate the functionality of having multiple recovery phases (stages) and allow the tooling to run the phases after one and other or separately.
Rob J Meijer More generically, I feel a way should exist to communicate the 'left overs' a previous (non open, for example LE-only) tool left.
Joachim I guess if the tool is designed to handle multiple phases it should store its data somewhere. So it should be possible to convert results of such non open tooling to the format required. However I would opt to design the recovery functionality of these non-open tools into open tools. And not to limit ourselves making translators due to the design of these non-open tools.
  • Ability to be used as a library instead of a tool. Ability to access metadata true library, and thus the ability to set metadata from the carving modules. This would be extremely usefull for integrating the project into a framework like ocfa.
Joachim I guess most of the code could be integrated into libraries, but I would not opt integrating tool functionality into a library
  • A wild idea that I hope at least one person will have a liking for: It might be very interesting to look at the possibilities of using a multi process style of module support and combine it with a least authority design. On platforms that support AppArmor (or similar) and uid based firewall rules, this could make for the first true POLA (principle of least authority) based forensic tool ever. POLA based forensics tools should make for a strong integrity guard against many anti forensics. Alternatively we could look at integrating a capability secure language (E?) for implementation of at least validation modules. I don't expect this idea to make it, but mentioning it I hope might spark off less strong alternatives that at least partially address the integrity + anti-forensics problem. If we can in some way introduce POLA to a wider forensics public, other tools might also pick up on it what would be great.
Joachim Could you give an example of how you see this in action?
Rob J Meijer I see two layers where using POLA could be applied. The best one would require one of the folowing as prerequisites:
  • The libaff/libewf layer is moved to a fuse implementation (for example carvfs).
  • Libewf/Libaff are updated to accept opened filhandles instead of demanding to open their own files.
If one of these is fulfilled, than the tool running as some user can just have the simple task of opening the image files, starting up the 'real' tool and handing over the appropriate file handles. If the real tool runs with a restrictive AppArmor profile, and is started suid to a tool specific user that also has its own iptables uid based filter, than the real tool will run with least authority.
A second alternative, if neither of the first prerequisite could not be bet, would be to run the modules as confined processes and have a non confined process run as proxy for the first.
A third probably far fetched alternative would be to embed an object capability language in the tool and make the module interface thus that modules are to be written in this ocap language.
A 4th alternative might include minorfs or plash, but I havn't geven those sufficient thinking hours yet.


  • Mark I think it would be very handy to have a CSV, TSV, XML or other delimited output (log)file with information about carved files. This output file can then be stored in a database or Excel sheet (report function)

Format syntax specification

  • Carving data structures. For example, extract all TCP headers from image by defining TCP header structure and some fields (e.g. source port > 1024, dest port = 80). This will extract all data matching the pattern and write a file with other fields. Another example is carving INFO2 structures and URL activity records from index.dat .FUF 20:51, 28 October 2008 (UTC)
    • This has the opportunity to be extended to the concept of "point at blob FOO and interpret it as BAR"

.FUF added: The main idea is to allow users to define structures, for example (in pascal-like form):

Field1: Byte = 123;
SomeTextLength: DWORD;
SomeText: string[SomeTextLength];
Field4: Char = 'r';
...

This will produce something like this:

Field1 = 123
SomeTextLength = 5
SomeText = 'abcd1'
Field4 = 'r'

(In text or raw forms.)

Opinions?

Opinion: Simple pattern identification like that may not suffice, I think Simson's original intent was not only to identify but to allow for validation routines (plugins, as the original wording was). As such, the format syntax would need to implement a large chunk of some programming language in order to be sufficiently flexible. RB

Joachim In my option your example is too limited. Making the revit configuration I learned you'll need a near programming language to specify some file formats. A simple descriptive language is too limiting. I would also go for 2 bytes with endianess instead of using terminology like WORD and small integer, it's much more clear. The configuration also needs to deal with aspects like cardinality, required and optional structures.

This is simply data structures carving, see ideas above. Somebody (I cannot track so many changes per day) separated the original text. There is no need to count and join different structures. .FUF 19:53, 31 October 2008 (UTC)
Joachim This was probably me is the text back in it's original form?
I started it by moving your Revit07 comment to the validator/plugin section in this edit, since I was still at that point thinking operational configuration for that section, not parser configurations. RB
Joachim I renamed the title to format syntax, clarity is important ;-)

Please take a look at the revit07 format syntax specification (configuration). It's not there yet but goes a far way. Some things currently missing:

  • bitwise alignment
  • handling encapsulated streams (MPEG/capture files)
  • handling content based formats (MBOX)

Caving algorithm

Joachim

  • should we allow for multiple carving phases (runs/stages)?
I opt yes (separation of concern)
  • should we allow for multiple carving algorithms?
I opt yes, this allows testing of different approaches
  • Should the algorithm try to do as much in 1 run over the input data? To reduce IO?
I opt that the tool should allow for multiple and single run over the input data to minimize the IO or the CPU as bottleneck
  • Interaction between algorithm and validators
    • does the algorithm passes data blocks to the validators?
    • does a validator need to maintain a state?
    • does a validator need to revert a state?
    • How do we deal with embedded files and content validation? Do the validators call another validator?
  • do we use the assumption that a data block can be used by a single file (with the exception of embedded/encapsulated files)?
  • Revit07 allows for multiple concurrent result files states to deal with fragmentation. One has the attribute of being active (the preferred) and the other passive. Do we want/need something similar? The algorithm adds block of input data (offsets) to these result files states.
    • if so what info would these result files states require (type, list of input data blocks)
  • how do we deal with file system remainders?
    • Can we abstract them and compare them against available file system information?
  • Do we carve file systems in files?
I opt that at least the validator uses this information

Caving scenarios

Joachim

  • normal file (file structure, loose text based structure (more a content structure?))
  • fragmented file (the file entirely exist)
  • a file fragment (the file does not entirely exist)
  • intertwined file
  • encapsulated file (MPEG/network capture)
  • embedded file (JPEG thumbnail)
  • obfuscation ('encrypted' PFF) this also entails encryption and/or compression
  • file system in file

File System Awareness

Background: Why be File System Aware?

Advantages of being FS aware:

  • You can pick up sector allocation sizes
Joachim do you mean file system block sizes?
  • Some file systems may store things off sector boundaries. (ReiserFS with tail packing)
  • Increasingly file systems have compression (NTFS compression)
  • Carve just the sectors that are not in allocated files.

Tasks that would be required

Discussion

As noted above, TSK should be utilized as much as possible, particularly the filesystem-aware portion. If we want to identify filesystems outside of its supported set, it would be more worth our time to work on implementing them there than in the carver itself. RB
I guess this tool operates like Selective file dumper and can recover files in both ways (or not?). Recovering files by using carving can recover files in situations where sleuthkit does nothing (e.g. file on NTFS was deleted using ntfs-3g, or filesystem was destroyed or just unknown). And we should build the list of filesystems supported by carver, not by TSK. .FUF 07:08, 29 October 2008 (UTC)
This tool is still in the early planning stages (requirements discovery), hence few operational details (like precise modes of operation) have been fleshed out - those will and should come later. The justification for strictly using TSK for the filesystem-sensitive approach is simple: TSK has good filesystem APIs, and it would be foolish to create yet another standalone, incompatible implementation of filesystem(foo) when time would be better spent improving those in TSK, aiding other methods of analysis as well. This is the same reason individuals that have implemented several other carvers are participating: de-duplication of effort. RB
Joachim A design problem might be that TSK currently is a single library operating on multiple layers (storage media IO, volume/partition analysis and file system analysis). I'm not aware how easily the parts can be used separately. But I estimate that for the carver we want to use these 3 layers differently than TSK currently does.

Joachim I would like to have the carver (recovery tool) also do recovery using file allocation data or remainders of file allocation data.

Joachim I would go as far to ask you all to look beyond the carver as a tool and look from the perspective of the carver as part of the forensic investigation process. In my eyes certain information needed/acquired by the carver could be also very useful investigative information i.e. what part of a hard disk contains empty sectors.

Supportive tooling

Joachim

  • validator (definitions) tester (detest in revit07)
  • tool to make configuration based definitions
  • post carving validation
  • the carver needs to provide support for fuse mount of carved files (carvfs)

Testing

Joachim

  • automated testing
  • test data

Validator Construction

Options:

  • Write validators in C/C++
    • Joachim you mean dedicated validators
  • Have a scripting language for writing them (python? Perl?) our own?
    • Joachim use easy to embed programming languages i.e. Phyton or Lua
  • Use existing programs (libjpeg?) as plug-in validators?
    • Joachim define a file structure api for this

Existing Code that we have

Joachim Please add any missing links

Documentation/Articles

  • DFRWS2006/2007 carving challenge results
  • DFRWS2008 paper on carving

Carvers

Possible file structure validator libraries

Input support

Volume/Partition support

File system support

  • TSK
  • photorec FS code
  • implementations of FS in Linux/BSD

Content support

Zero storage support

POLA

Implementation Timeline

  1. gather the available resources/ideas/wishes/needs etc. (I guess we're in this phase)
  2. start discussing a high level design (in terms of algorithm, facilities, information needed)
    1. input formats facility
    2. partition/volume facility
    3. file system facility
    4. file format facility
    5. content facility
    6. how to deal with fragment detection (do the validators allow for fragment detection?)
    7. how to deal with recombination of fragments
    8. do we want multiple carving phases in light of speed/precision tradeoffs
  3. start detailing parts of the design
    1. Discuss options for a grammar driven validator?
    2. Hard-coded plug-ins?
    3. Which existing code can we use?
  4. start building/assembling parts of the tooling for a prototype
    1. Implement simple file carving with validation.
    2. Implement gap carving
  5. Initial Release
  6. Implement the threaded carving that .FUF is describing above.

Joachim Shouldn't multi threaded carving (MTC) not be part of the 1st version? The MT approach makes for different design decisions

It is virtually impossible to turn a non-MT application into an MT application .Simsong 06:37, 3 November 2008 (UTC)