Difference between pages "AFF Development Task List" and "Forensic corpora"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
m (Medium Priority)
 
 
Line 1: Line 1:
AFFLIB has been depreciated. As a result, this page is now obsolete.
+
This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.
  
== High Priority ==
+
= Disk Images =
 +
 +
;The Real Data Corpus.
 +
: Between 1998 and 2006, [[Simson Garfinkel|Garfinkel]] acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the  developing of new forensic techniques and the sanitization practices of computer users.
  
* When afinfo -a is run on a non-AFF file, it notes it is a "Raw" file, but continues to attempt to process segments. It should exit if it does not find valid AFF segments. For example, running it against a raw image of a 40GB disk created with aimage, afinfo -a reported 2,386 segments then finished with the error message "Cannot calculate missing pages."
+
: Garfinkel, S. and Shelat, A., [http://www.simson.net/clips/academic/2003.IEEE.DiskDriveForensics.pdf "Remembrance of Data Passed: A Study of Disk Sanitization Practices,"] IEEE Security and Privacy, January/February 2003.
  
* The library does not compile on 64-bit versions of Fedora Core 7 Linux.
+
;The Honeynet Project Forensic Challenge.
 +
: In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.
 +
: http://www.honeynet.org/challenge/index.html
 +
: Other challenges were released in 2010 and 2011, and two contained partial disk images.
 +
: [https://www.honeynet.org/challenges/2011_7_compromised_server Challenge 7: Compromised Server]
 +
: [https://www.honeynet.org/node/751 Challenge 9: Mobile Malware]
  
* Create man pages and/or documentation for AFF toolkit. To wit:
+
;Honeynet Project Scans of the Month
 +
: The Honeynet Project provided network scans in the majority of its Scan of the Month challenges. Some of the challenges provided disk images instead.  The Sleuth Kit's Wiki lists Brian Carrier's responses to those challenges.
 +
: http://wiki.sleuthkit.org/index.php?title=Case_Studies
  
* [[aimage]]
+
;The [http://www.cfreds.nist.gov/ Computer Forensic Reference Data Sets] project from [[National Institute of Standards and Technology|NIST]] hosts a few sample cases that may be useful for examiners to practice with:
* [[ident]]
+
: http://www.cfreds.nist.gov/Hacking_Case.html
* [[afcat]]
+
* [[afcompare]]
+
* [[afconvert]]
+
* [[affix]]
+
* [[affuse]]
+
* [[afinfo]]
+
* [[afstats]]
+
* [[afxml]]
+
* [[afsegment]]
+
  
* Add a usage description to [[afcat]]. When run with no arguments the output should say what the program does.
+
; Digital Forensics Tool Testing Images can be downloaded from Sourceforge
 +
: http://dftt.sourceforge.net/
  
* Create man pages and/or documentation for AFF library functions (e.g. ,<tt>af_open</tt>, <tt>af_get_imagesize</tt>)
+
; Shortinfosec: computer forensics competition
 +
: http://www.shortinfosec.net/2008/07/competition-computer-forensic.html
 +
: In the competition, you will have to analyze a submitted disk image for incriminating evidence.
 +
: (Note: Unfortunately, when checked in October, 2011, the disk image seemed unavailable.)
  
* Build library as a shared library using libtool. This will allow developers using the library to just link to the AFF. Without it, developers must link to the static library and the individual libraries necessary <em>on that machine</em>. There is no good way to determine those extra libraries.
+
; Lance Mueller has created some disk images; they can be downloaded from his blog
 +
: http://www.forensickb.com/search?q=practical
  
* Document that <tt>af_write</tt> may not be called without first setting the <tt>image_pagesize</tt> value inside of the <tt>AFFILE</tt> structure. Not doing so causes a divide by zero error. Perhaps we should 1. Check that <tt>image_pagesize</tt> is not zero and 2. Set <tt>image_pagesize</tt> to a known good default value when opening a new AFF file for writing.
+
; Barry Grundy created some disk images as parts of a Linux-based forensics tutorial
 +
: http://linuxleo.com
  
* Check aimage ability to write a file of 1,073,741,825 bytes ((2**30)+1). Correctly reported reading/writing a file that was a 1,073,741,824 random byte stream, but did not pick up the extra byte when it was added to the file. ls -la correctly shows the size with the extra byte. Also, added 42 additional bytes which were not apparently read or written.  UPDATE - With 511 bytes added, still didn't read/write full file, however, adding 512 bytes did cause the whole file (1,073,742,336 bytes) to be read/written.
+
;The PyFlag standard test image set
 +
: http://pyflag.sourceforge.net/Documentation/tutorials/howtos/test_image.html
  
== Medium Priority ==
+
;The Digital Forensic Research Workshop's Rodeos and Challenges
 +
: Several of the Rodeos and Challenges from DFRWS released their data and scenario writeups. The following had disk images as parts of their scenario:
 +
* [http://www.cfreds.nist.gov/dfrws/Rhino_Hunt.html 2005 Rodeo] (Hosted on CFReDS)
 +
* [http://dfrws.org/2008/rodeo.shtml 2008 Rodeo]
 +
* [http://dfrws.org/2009/rodeo.shtml 2009 Rodeo]
 +
* [http://dfrws.org/2009/challenge/index.shtml 2009 Challenge]
 +
* [http://dfrws.org/2011/challenge/index.shtml 2011 Challenge]
  
* Is there a set of segment names that must be defined to have a ''valid'' AFF file?
+
= Memory Images =
  
* Document that <tt>af_open</tt> (when writing a file) does more than a standard <tt>fopen</tt> command. The command writes an AFF stub of some kind to the output file. Users should be cautioned not to use this function as a test, lest they overwrite data.
+
The [https://www.volatilesystems.com/default/volatility Volatility] FAQ provides a listing of openly-available [https://code.google.com/p/volatility/wiki/FAQ#Are_there_any_public_memory_samples_available_that_I_can_use_for memory images].
  
* Does <tt>af_open</tt> refuse to open a file for writing if it already exists? If so, what kind of error does it return?
+
= Network Packets and Traces =
  
* Document how to programmatically enumerate all segments and values in a file. That is, explain how to get the output of <tt>$ afinfo -a</tt>.
+
== DARPA ID Eval ==
  
== Low Priority ==
+
''The DARPA Intrusion Detection Evaluation.'' In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.
  
* Add library function to open standard input. Perhaps:
+
* [http://www.ll.mit.edu/IST/ideval/data/1998/1998_data_index.html 1998 DARPA Intrusion Detection Evaluation]
 +
* [http://www.ll.mit.edu/IST/ideval/data/1999/1999_data_index.html 1999 DARPA Intrusion Detection Evaluation]
 +
* [http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html 2000 DARPA Intrusion Detection Scenario Specific]
  
<pre>AFFILE * af_open_stdin(void);</pre>
+
== WIDE==
 +
''The [http://www.wide.ad.jp/project/wg/mawi.html MAWI Working Group] of the [http://www.wide.ad.jp/ WIDE Project]'' maintains a [http://tracer.csl.sony.co.jp/mawi/ Traffic Archive]. In it you will find:
 +
* daily trace of a trans-Pacific T1 line;
 +
* daily trace at an IPv6 line connected to 6Bone;
 +
* daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01.
 +
 
 +
Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of [[tcpdpriv]].
 +
 
 +
==Wireshark==
 +
The open source Wireshark project (formerly known as Ethereal) has a website with many network packet captures:
 +
* http://wiki.wireshark.org/SampleCaptures
 +
 
 +
==NFS Packets==
 +
The Storage Networking Industry Association has a set of network file system traces that can be downloaded from:
 +
* http://iotta.snia.org/traces
 +
* http://tesla.hpl.hp.com/public_software/
 +
 
 +
==Other==
 +
Github user "markofu" has aggregated several other network captures into a Git repository.
 +
* https://github.com/markofu/pcaps
 +
 
 +
=Email messages=
 +
 
 +
''The Enron Corpus'' of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.
 +
 
 +
* http://www.cs.cmu.edu/~enron
 +
* http://www.enronemail.com/
 +
 
 +
The NIST '''TextREtrieval Conference 2007''' has released a public Spam corpus:
 +
* http://plg.uwaterloo.ca/~gvcormac/spam/
 +
 
 +
Email Messages Corpus Parsed from W3C Lists (for TRECENT 2005)
 +
* http://tides.umiacs.umd.edu/webtrec/trecent/parsed_w3c_corpus.html
 +
 
 +
=Text Files=
 +
 
 +
==Log files==
 +
[http://crawdad.cs.dartmouth.edu/index.php CRAWDAD] is a community archive for wireless data.
 +
 
 +
[http://www.caida.org/data/ CAIDA] collects a wide variety of data.
 +
 
 +
[http://www.dshield.org/howto.html DShield] asks users to submit firewall logs.
 +
 
 +
==Text for Text Retrieval==
 +
The [http://trec.nist.gov Text REtrieval Conference (TREC)] has made available a series of [http://trec.nist.gov/data.html text collections].
 +
 
 +
==American National Corpus==
 +
The [http://www.americannationalcorpus.org/ American National Corpus (ANC) project] is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.
 +
 
 +
==British National Corpus==
 +
The [http://www.natcorp.ox.ac.uk/ British National Corpus (100)] is a 100 million word collection of written and spoken english from a variety of sources.
 +
 
 +
==IEEE VAST Challenges==
 +
IEEE Visual Analytics Science & Technology Challenges
 +
* [http://hcil.cs.umd.edu/localphp/hcil/vast/index.php 2009 Challenge]
 +
* [http://hcil.cs.umd.edu/localphp/hcil/vast10/index.php 2010 Challenge]
 +
* [http://hcil.cs.umd.edu/localphp/hcil/vast11/ 2011 Challenge]
 +
 
 +
=Images=
 +
; [http://www.cs.washington.edu/research/imagedatabase] UW Image Database
 +
: A set of freely redistributable images from all over the world, used for content-based image retrieval.
 +
 
 +
=Voice=
 +
==CALLFRIEND==
 +
CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.
 +
 
 +
==TalkBank==
 +
TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.
 +
 
 +
==Augmented Multi-Party Interaction Corpus==
 +
The [http://corpus.amiproject.org/ AMI Meeting Corpus] has 100 hours of meeting recordings.
 +
 
 +
==Other Corpora==
 +
* Under an NSF grant, Kam Woods and [[Simson Garfinkel]] created a website for digital corpora [http://digitalcorpora.org]. The site includes a complete training scenario, including disk images, packet captures and exercises.
 +
 
 +
* The [http://corpus.canterbury.ac.nz/ Canterbury Corpus] is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi.  You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.
 +
 
 +
* The [http://traces.cs.umass.edu/index.php/Main/HomePage UMass Trace Repository] provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.
 +
 
 +
* [http://arstechnica.com/science/news/2009/02/aaas-60tb-of-behavioral-data-the-everquest-2-server-logs.ars Sony has made 60TB of Everquest 2 logs available to researchers.]  What's there? "everything."
 +
 
 +
* UCI's [http://networkdata.ics.uci.edu/resources.php Network Data Repository] provides data sets of a diverse set of networks.  Some of the networks are related to computers, some aren't.
 +
 
 +
= External Links =
 +
* [http://articles.forensicfocus.com/2013/10/18/forge-computer-forensic-test-image-generator/ ForGe – Computer Forensic Test Image Generator], Hunnu Visti, October 18, 2013

Revision as of 00:42, 23 October 2013

This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.

Disk Images

The Real Data Corpus.
Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.
Garfinkel, S. and Shelat, A., "Remembrance of Data Passed: A Study of Disk Sanitization Practices," IEEE Security and Privacy, January/February 2003.
The Honeynet Project Forensic Challenge.
In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.
http://www.honeynet.org/challenge/index.html
Other challenges were released in 2010 and 2011, and two contained partial disk images.
Challenge 7: Compromised Server
Challenge 9: Mobile Malware
Honeynet Project Scans of the Month
The Honeynet Project provided network scans in the majority of its Scan of the Month challenges. Some of the challenges provided disk images instead. The Sleuth Kit's Wiki lists Brian Carrier's responses to those challenges.
http://wiki.sleuthkit.org/index.php?title=Case_Studies
The Computer Forensic Reference Data Sets project from NIST hosts a few sample cases that may be useful for examiners to practice with
http://www.cfreds.nist.gov/Hacking_Case.html
Digital Forensics Tool Testing Images can be downloaded from Sourceforge
http://dftt.sourceforge.net/
Shortinfosec
computer forensics competition
http://www.shortinfosec.net/2008/07/competition-computer-forensic.html
In the competition, you will have to analyze a submitted disk image for incriminating evidence.
(Note: Unfortunately, when checked in October, 2011, the disk image seemed unavailable.)
Lance Mueller has created some disk images; they can be downloaded from his blog
http://www.forensickb.com/search?q=practical
Barry Grundy created some disk images as parts of a Linux-based forensics tutorial
http://linuxleo.com
The PyFlag standard test image set
http://pyflag.sourceforge.net/Documentation/tutorials/howtos/test_image.html
The Digital Forensic Research Workshop's Rodeos and Challenges
Several of the Rodeos and Challenges from DFRWS released their data and scenario writeups. The following had disk images as parts of their scenario:

Memory Images

The Volatility FAQ provides a listing of openly-available memory images.

Network Packets and Traces

DARPA ID Eval

The DARPA Intrusion Detection Evaluation. In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.

WIDE

The MAWI Working Group of the WIDE Project maintains a Traffic Archive. In it you will find:

  • daily trace of a trans-Pacific T1 line;
  • daily trace at an IPv6 line connected to 6Bone;
  • daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01.

Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of tcpdpriv.

Wireshark

The open source Wireshark project (formerly known as Ethereal) has a website with many network packet captures:

NFS Packets

The Storage Networking Industry Association has a set of network file system traces that can be downloaded from:

Other

Github user "markofu" has aggregated several other network captures into a Git repository.

Email messages

The Enron Corpus of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.

The NIST TextREtrieval Conference 2007 has released a public Spam corpus:

Email Messages Corpus Parsed from W3C Lists (for TRECENT 2005)

Text Files

Log files

CRAWDAD is a community archive for wireless data.

CAIDA collects a wide variety of data.

DShield asks users to submit firewall logs.

Text for Text Retrieval

The Text REtrieval Conference (TREC) has made available a series of text collections.

American National Corpus

The American National Corpus (ANC) project is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.

British National Corpus

The British National Corpus (100) is a 100 million word collection of written and spoken english from a variety of sources.

IEEE VAST Challenges

IEEE Visual Analytics Science & Technology Challenges

Images

[1] UW Image Database
A set of freely redistributable images from all over the world, used for content-based image retrieval.

Voice

CALLFRIEND

CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.

TalkBank

TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.

Augmented Multi-Party Interaction Corpus

The AMI Meeting Corpus has 100 hours of meeting recordings.

Other Corpora

  • Under an NSF grant, Kam Woods and Simson Garfinkel created a website for digital corpora [2]. The site includes a complete training scenario, including disk images, packet captures and exercises.
  • The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.
  • The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.
  • UCI's Network Data Repository provides data sets of a diverse set of networks. Some of the networks are related to computers, some aren't.

External Links