Difference between pages "Residual Data on Used Equipment" and "Forensic corpora"

From ForensicsWiki
(Difference between pages)
Jump to: navigation, search
m (See Also)
 
(Added and formatted pyflag info)
 
Line 1: Line 1:
Used hard drives are frequently a good source of images for testing forensic tools. That's because many individuals, companies and organizations neglect to properly sanitize their hard drives before they are sold on the secondary market.
+
This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.
  
You can find used hard drives on eBay, at swap meets, yard sales, and even on the street.  
+
= Disk Images =
 +
 +
;''The Harvard/MIT Drive Image Corpus.'' Between 1998 and 2006, [[Simson Garfinkel|Garfinkel]] acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the  developing of new forensic techniques and the sanitization practices of computer users.
  
==ATMs==
+
* Garfinkel, S. and Shelat, A., [http://www.simson.net/clips/academic/2003.IEEE.DiskDriveForensics.pdf "Remembrance of Data Passed: A Study of Disk Sanitization Practices,"] IEEE Security and Privacy, January/February 2003.
* '''2009-11-21''': Robert Siciliano, a security consultant to Intelius.com and personal ID theft expert, buys an ATM machine for $750 from a bar in Boston. The machine comes with more than 1000 credit and ATM card numbers. http://www.theregister.co.uk/2009/11/18/second_hand_atm_fraud_risk/
+
==Hard Drives==
+
  
There have been several incidents in which individual have purchased a large number of hard drives and written about what they have found. This web page is an attempt to catalog all of those stories in chronological order.
+
;''The Honeynet Project Forensic Challenge.'' In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.
  
* '''2003-01''': [[Simson Garfinkel]] and Abhi Shelat at MIT publish a study in ''IEEE Security and Privacy Magazine''  which documents large amount of personal and business-sensitive information found on 150 drives purchased on the secondary market.
+
* [http://www.honeynet.org/challenge/index.html The Honeynet Project's Forensic Challenge], January 2001.
  
* '''2006-06''': A man buys a family's hard drive at a fleamarket in Chicago after the family's hard drive is upgraded by Best Buy. Apparently somebody at Best Buy violated company policy and instead of destroying the hard drive, they sold it.  
+
;''The [http://www.cfreds.nist.gov/ Computer Forensic Reference Data Sets]'' project from [[National Institute of Standards and Technology|NIST]] hosts a few sample cases that may be useful for examiners to practice with.
  
* '''2006-08-10''': The University of Glamorgan in Wales purchased 317 used hard drives from the UK, Australia, Germany, and the US. 25% of the 200 drives purchased from the UK market had been completely wiped. 40% of the purchased drives didn't work. 40% came from businesses, of which 23% contained enough information to identify the company. 5% had business sensitive information. 25% came from individuals, of which many had pornography, and 2 had to be referred to the police for suspected child pornography.
+
* http://www.cfreds.nist.gov/Hacking_Case.html
  
* '''2006-08-14''': [http://news.bbc.co.uk/2/hi/business/4790293.stm BBC News] reports on bank account information recovered from used PC hard drives and being sold in Nigeria for £20 each. The PCs had apparently come from recycling points run by UK town councils that are then "recycled" by being sent to Africa.
+
;''The PyFlag standard test image set''
 +
* http://pyflag.sourceforge.net/Documentation/tutorials/howtos/test_image.html
  
* '''2006-08-15''': Simson Garfinkel presents results of a study of 1000 hard drives (750 working) at the 2006 Workshop on Digital Forensics. Results of the study show that information can be correlated across hard drives using Garfinkel's [[Cross Drive Analysis]] approach.
+
= Network Packets and Traces =
  
* '''2007-02-06''': [http://www.fulcruminquiry.com Fulcrum Inquiry], a Los Angeles litigation support firm, purchased 70 used hard drives from 14 firms and discovered confidential information on 2/3rds of the drives.
+
== DARPA ID Eval ==
  
* '''2007-08-30''': Bill Ries-Kinght, an IT consultant, purchases a 120GB Seagate hard drive on eBay for $69. Although the drive was advertised as being new, it apparently was previously used by the campaign of Mike Beebe, who won the Arkansas state governorship in November 2006. "Among the files were documents listing the private cell phone numbers of political allies, including US Senators Blanch Lincoln and Mark Pryor and US Representatives Marion Berry, Mike Ross and Vic Snyder. It also included talking points to guide the candidate as he called influential people whose support he sought," states an article published in [http://www.theregister.co.uk/2007/08/30/governors_data_sold_on_ebay/ The Register].
+
''The DARPA Intrusion Detection Evaluation.'' In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.
  
* '''2008-01-28''': Gregory Evans, a security consultant in Marina Del Ray, Calif., bought a $500 computer at a swap meet from a former mortgage company. It contained credit reports on 300 people in a deleted file, according to an article published in [http://www.nydailynews.com/money/2008/01/28/2008-01-28_sensitive_info_lives_on_in_old_computers.html The New York Daily NEws]. The security consultant was also able to recover the usernames and passwords of the mortgage company's former employees.
+
* [http://www.ll.mit.edu/IST/ideval/data/1998/1998_data_index.html 1998 DARPA Intrusion Detection Evaluation]
 +
* [http://www.ll.mit.edu/IST/ideval/data/1999/1999_data_index.html 1999 DARPA Intrusion Detection Evaluation]
 +
* [http://www.ll.mit.edu/IST/ideval/data/2000/2000_data_index.html 2000 DARPA Intrusion Detection Scenario Specific]
  
*'''2009-02-10''': Michael Kessler, CEO of Kessler International, a New York City forensics firm, bought 100 "relatively modern drives, the vast majority of them Serial ATA" from eBay over the course of 6 months. The drives ranged in size from 400GB to 300GB. 40% of the drives were found to contain sensitive data. [http://www.computerworld.com/action/article.do?command=viewArticleBasic&taxonomyName=storage&articleId=9127717&taxonomyId=19&intsrc=kc_top]
+
== WIDE==
 +
''The [http://www.wide.ad.jp/project/wg/mawi.html MAWI Working Group] of the [http://www.wide.ad.jp/ WIDE Project]'' maintains a [http://tracer.csl.sony.co.jp/mawi/ Traffic Archive]. In it you will find:
 +
* daily trace of a trans-Pacific T1 line;
 +
* daily trace at an IPv6 line connected to 6Bone;
 +
* daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01.
  
*'''2009-05-07''': University of Glamorgan bought disks in its annual survey of used hard drives and found "Details of test launch procedures for the THAAD (Terminal High Altitude Area Defence) ground-to-air missile defence system. [http://news.bbc.co.uk/2/hi/uk_news/wales/8036324.stm Missile data found on hard drives, BBC News, May 7, 2009]
+
Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of [[tcpdpriv]].
  
*'''2009-07-30''': Reporters working for the PBS show Frontline on an article about electronic waste find hard drives in Ghana that contain "hundreds and hundreds of documents about government contracts" from a hard drive that had been previously used by a TSA subcontractor. The documents were marked "competitive sensitive" and covered contracts with the Defense Intelligence Agency. The hard drive was not encrypted.  [http://itworld.com/security/69758/reporters-find-northrop-grumman-data-ghana-market Reporters find Northrop Grumman data in Ghana market, Robert McMillan, IT World, June 24, 2009]
+
==Wireshark==
 +
The open source Wireshark project (formerly known as Ethereal) has a website with many network packet captures:
 +
* http://wiki.wireshark.org/SampleCaptures
  
*'''2009-09-23''': US DoD sells computers without cleaning them first. http://fcw.com/articles/2009/09/23/inspector-general-audit.aspx
+
==NFS Packets==
 +
The Storage Networking Industry Association has a set of network file system traces that can be downloaded from:
 +
* http://iotta.snia.org/traces
 +
* http://tesla.hpl.hp.com/public_software/
  
==Cell Phones==
+
=Text Files=
* [http://www.wired.com/techbiz/media/news/2003/08/60052 BlackBerry Reveals Bank's Secrets], Wired, August 8, 2005.
+
==Email messages==
* [http://www.taipeitimes.com/News/feat/archives/2008/09/28/2003424400 Who has your old phone's data], Pete Warren, The Guardian, London, Sept. 28, 2008, page 13.
+
* [http://www.myfoxdc.com/myfox/pages/News/Detail?contentId=8055902&version=1&locale=EN-US&layoutCode=TSTY&pageId=3.2.1 McCain Campaign Sells Info-Loaded Blackberry to FOX 5 Reporter], by Tisha Thompson and Rick Yarborough, FOX 5 Investigative Unit, 11 December 2008.  (See also [http://www.theregister.co.uk/2008/12/12/mccain_blackberry/])
+
  
==Cameras==
+
''The Enron Corpus'' of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.
* [http://www.telegraph.co.uk/news/uknews/3107003/Camera-sold-on-eBay-contained-MI6-files.html Camera sold on eBay contained MI6 files], Jessica Salter, Telegraph, September 30, 2008.
+
  
==Network Equipment==
+
* http://www.cs.cmu.edu/~enron
* [http://www.pcpro.co.uk/news/227190/council-sells-security-hole-on-ebay.html Council sells security hole on Ebay], Matthew Sparkes, PC Pro, September 29, 2008 - Kirkless Council (UK) sells a Cisco [[VPN]] 3002 Concentrator on Ebay for 99 pence. The device is purchased by Andrew Mason, a security consultant, who discovers that the Cisco [[VPN]] device still has the full configuration for the Kirkless Council and the device hasn't been deactivated.
+
* http://www.enronemail.com/
  
==MP3 Players==
+
==Log files==
* [http://news.yahoo.com/s/ap/20090127/ap_on_re_as/as_new_zealand_us_military_files NZ man's MP3 player holds US military files], Associated Press, Jan 27, 2009. A man from New Zealand bought an MP3 player at a thrift shop in Oklahoma that had 60 US military files, "including names and telephone numbers for American soldiers."
+
[http://crawdad.cs.dartmouth.edu/index.php CRAWDAD] is a community archive for wireless data.
  
==See Also==
+
[http://www.caida.org/data/ CAIDA] collects a wide variety of data.
[[Residual Data]]
+
 
[[Residual Data in Document Files]]
+
[http://www.dshield.org/howto.html DShield] asks users to submit firewall logs.
 +
 
 +
==Text for Text Retrieval==
 +
The [http://trec.nist.gov Text REtrieval Conference (TREC)] has made available a series of [http://trec.nist.gov/data.html text collections].
 +
 
 +
==American National Corpus==
 +
The [http://www.americannationalcorpus.org/ American National Corpus (ANC) project] is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.
 +
 
 +
==British National Corpus==
 +
The [http://www.natcorp.ox.ac.uk/ British National Corpus (100)] is a 100 million word collection of written and spoken english from a variety of sources.
 +
 
 +
=Voice=
 +
==CALLFRIEND==
 +
CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.
 +
 
 +
==TalkBank==
 +
TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.
 +
 
 +
==Augmented Multi-Party Interaction Corpus==
 +
The [http://corpus.amiproject.org/ AMI Meeting Corpus] has 100 hours of meeting recordings.
 +
 
 +
==Other Corpora==
 +
The [http://corpus.canterbury.ac.nz/ Canterbury Corpus] is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi.  You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.
 +
 
 +
The [http://traces.cs.umass.edu/index.php/Main/HomePage UMass Trace Repository] provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.

Revision as of 18:46, 12 July 2008

This page describes large-scale corpora of forensically interesting information that are available for those involved in forensic research.

Disk Images

The Harvard/MIT Drive Image Corpus. Between 1998 and 2006, Garfinkel acquired 1250+ hard drives on the secondary market. These hard drive images have proven invaluable in performing a range of studies such as the developing of new forensic techniques and the sanitization practices of computer users.
The Honeynet Project Forensic Challenge. In 2001 the Honeynet project distributed a set of disk images and asked participants to conduct a forensic analysis of a compromised computer. Entries were judged and posted for all to see. The drive and writeups are still available online.
The Computer Forensic Reference Data Sets project from NIST hosts a few sample cases that may be useful for examiners to practice with.
The PyFlag standard test image set

Network Packets and Traces

DARPA ID Eval

The DARPA Intrusion Detection Evaluation. In 1998, 1999 and 2000 the Information Systems Technology Group at MIT Lincoln Laboratory created a test network complete with simulated servers, clients, clerical workers, programmers, and system managers. Baseline traffic was collected. The systems on the network were then “attacked” by simulated hackers. Some of the attacks were well-known at the time, while others were developed for the purpose of the evaluation.

WIDE

The MAWI Working Group of the WIDE Project maintains a Traffic Archive. In it you will find:

  • daily trace of a trans-Pacific T1 line;
  • daily trace at an IPv6 line connected to 6Bone;
  • daily trace at another trans-Pacific line (100Mbps link) in operation since 2006/07/01.

Traffic traces are made by tcpdump, and then, IP addresses in the traces are scrambled by a modified version of tcpdpriv.

Wireshark

The open source Wireshark project (formerly known as Ethereal) has a website with many network packet captures:

NFS Packets

The Storage Networking Industry Association has a set of network file system traces that can be downloaded from:

Text Files

Email messages

The Enron Corpus of email messages that were seized by the Federal Energy Regulatory Commission during its investigation of Enron.

Log files

CRAWDAD is a community archive for wireless data.

CAIDA collects a wide variety of data.

DShield asks users to submit firewall logs.

Text for Text Retrieval

The Text REtrieval Conference (TREC) has made available a series of text collections.

American National Corpus

The American National Corpus (ANC) project is creating a massive collection of American english from 1990 onward. The goal is to create a corpus of at least 100 million words that is comparable to the British National Corpus.

British National Corpus

The British National Corpus (100) is a 100 million word collection of written and spoken english from a variety of sources.

Voice

CALLFRIEND

CALLFRIEND is a database of recorded English conversations. A total of 60 recorded conversations are available from the University of Pennsylvania at a cost of $600.

TalkBank

TalkBank in an online database of spoken language. The project was originally funded between 1999 and 2004 by two National Science Foundation grants; ongoing support is provided by two NSF grants and one NIH grant.

Augmented Multi-Party Interaction Corpus

The AMI Meeting Corpus has 100 hours of meeting recordings.

Other Corpora

The Canterbury Corpus is a set of files used for testing lossless compression algorithms. The corpus consists of 11 natural files, 4 artificial files, 3 large files, and a file with the first million digits of pi. You can also find a copyof the Calgaruy Corpus at the website, which was the defacto standard for testing lossless compression algorithms in the 1990s.

The UMass Trace Repository provides network, storage, and other traces to the research community for analysis. The UMass Trace Repository is supported by grant #CNS-323597 from the National Science Foundation.