Difference between revisions of "Forensic corpora"

From Forensics Wiki
Jump to: navigation, search
m (Disk Images)
(Updated headings)
(2 intermediate revisions by one user not shown)
Line 1: Line 1:
 
This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.
 
This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.
  
=Disk Images=
+
== Disk Images ==
 
+
 
''The Harvard/MIT Drive Image Corpus.'' Between 1998 and 2006, [[Simson Garfinkel|Garfinkel]] acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the  developing of new forensic techniques and the sanitization practices of computer users.
 
''The Harvard/MIT Drive Image Corpus.'' Between 1998 and 2006, [[Simson Garfinkel|Garfinkel]] acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the  developing of new forensic techniques and the sanitization practices of computer users.
  
 
* Garfinkel, S. and Shelat, A., [http://www.simson.net/clips/academic/2003.IEEE.DiskDriveForensics.pdf "Remembrance of Data Passed: A Study of Disk Sanitization Practices,"] IEEE Security and Privacy, January/February 2003.
 
* Garfinkel, S. and Shelat, A., [http://www.simson.net/clips/academic/2003.IEEE.DiskDriveForensics.pdf "Remembrance of Data Passed: A Study of Disk Sanitization Practices,"] IEEE Security and Privacy, January/February 2003.
  
=Network Packets=
+
== Network Packets ==
 +
 
 +
=== DARPA ID Eval ===
  
 
''The DARPA Intrusion Detection Evaluation.'' In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.
 
''The DARPA Intrusion Detection Evaluation.'' In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.
Line 15: Line 17:
 
* [http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html 2000 DARPA Intrusion Detection Scenario Specific]
 
* [http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html 2000 DARPA Intrusion Detection Scenario Specific]
  
=Email messages=
+
=== WIDE===
 +
''The [http://www.wide.ad.jp/project/wg/mawi.html MAWI Working Group] of the [http://www.wide.ad.jp/ WIDE Project]'' maintains a [http://tracer.csl.sony.co.jp/mawi/ Traffic Archive]. In it you will find:
 +
* daily trace of a trans-Pacific T1 line
 +
* daily trace at an IPv6 line connected to 6Bone:
 +
* daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01:
 +
 
 +
Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of tcpdpriv.
 +
 
 +
===WireShark===
 +
The open source WireShark project (formerly known as Ethereal) has a website with many network packet captures:
 +
* http://wiki.wireshark.org/SampleCaptures
 +
 
 +
==Email messages==
  
 
''The Enron Corpus'' of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.
 
''The Enron Corpus'' of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.
Line 22: Line 36:
 
* http://www.enronemail.com/
 
* http://www.enronemail.com/
  
=Log files=
+
==Log files==
 
[http://crawdad.cs.dartmouth.edu/index.php CRAWDAD] is a community archive for wireless data.
 
[http://crawdad.cs.dartmouth.edu/index.php CRAWDAD] is a community archive for wireless data.
  
Line 29: Line 43:
 
[http://www.dshield.org/howto.html DShield] asks users to submit firewall logs..
 
[http://www.dshield.org/howto.html DShield] asks users to submit firewall logs..
  
=Voice=
+
==Voice==
 
CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.
 
CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.
  
Line 35: Line 49:
  
  
=Other Corpora=
+
==Other Corpora==
 
The [http://corpus.canterbury.ac.nz/ Canterbury Corpus] is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi.  You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.
 
The [http://corpus.canterbury.ac.nz/ Canterbury Corpus] is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi.  You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.

Revision as of 08:12, 4 April 2007

This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.

Contents

Disk Images

The Harvard/MIT Drive Image Corpus. Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.

Network Packets

DARPA ID Eval

The DARPA Intrusion Detection Evaluation. In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.

WIDE

The MAWI Working Group of the WIDE Project maintains a Traffic Archive. In it you will find:

  • daily trace of a trans-Pacific T1 line
  • daily trace at an IPv6 line connected to 6Bone:
  • daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01:

Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of tcpdpriv.

WireShark

The open source WireShark project (formerly known as Ethereal) has a website with many network packet captures:

Email messages

The Enron Corpus of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.

Log files

CRAWDAD is a community archive for wireless data.

CAIDA collects a wide variety of data.

DShield asks users to submit firewall logs..

Voice

CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.

TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.


Other Corpora

The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.