Difference between revisions of "Forensic corpora"
m (→Email messages)
|Line 17:||Line 17:|
Revision as of 23:16, 5 February 2007
This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.
The Garfinkel Used Hard drive Collection Project. Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques  and the sanitization practices of computer users.
The DARPA Intrusion Detection Evaluation. In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.
The Enron Corpus of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.
CRAWDAD is a community archive for wireless data.
CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.
TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.
The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.