Similarity searches accelerate P2P downloads by 30-70%

I know you've all searched for a file, only to see it be displayed over and over in a P2P client's search results. The file name or description may be different, but you can see the file is the same because of the size or some other element. Now imagine being able to downloading your file by connecting to everyone who has the same (more or less) file.

The limitation of P2P file sharing is of course the number of active users that have the file in question. A research team with members at Carnegie Mellon, Purdue, and Intel claims they can increase P2P speeds by 30-70% by using what they call Similarity-Enhanced Transfer (SET), by acknowledging that many of the files being shared contain pieces of identical data. Once the requested file is divided into small segments, the SET software searches for similar files using a method called "handprinting" (similar to techniques used to cluster search results or filter spam). Once similar files are identified, they are scanned for any individual chunks that are identical to pieces of the file being downloaded. As a result, SET should greatly expand the available sources of any given file. The beauty of the system is that it's not all theory; the team has successfully tested the technology with existing P2P networks.

They team will be presenting their technique at the 4th Symposium on Networked Systems Design and Implementation tomorrow, along with actual implementation code. "This is a technique that I would like people to steal. Developers should just take the idea and use it in their own systems," said David Anderson of Carnegie Mellon.

News source: Ars Technica

Report a problem with article
Previous Story

April Microsoft Updates - Links Galore

Next Story

New PS3 Owners Discover Noisy, Defective Console Batch

33 Comments

Commenting is disabled on this article.

The reason why it was not implemented is the massive overhead when you need to search who has the similar chunk. I do not see any P2P software implementing this soon.

In addition, this similarity search benefits only music files with changed ID3 tags mostly, so how fast can you accelerate the rate of downloading a 3-10MB MP3?

Well, its kinda based off bit torrent. The only thing that is different is what can act as sources. Say there are two videos (say initially from the same source). Then one guy changes the metadata (actors, title etc). This changes the hash and the system sees it as two different files. Then they cannot act as compatible sources.

This system is smart to recognize that everything but the header is same, so it can use the modified file as source for the downloading it (sans the header).

As a generalization, if two files have any number of 16KB chunks identical both files can be a source of the chunk even if the files (in entirety) are entirely different.

This system would greatly increase the number of sources, which in turn would increase speed. Note that if you are already running at close to peak speed, it wont help. Also if there is already a prolferation of sources, it wont be much help either.

soumyasch said,
Well, its kinda based off bit torrent. The only thing that is different is what can act as sources. Say there are two videos (say initially from the same source). Then one guy changes the metadata (actors, title etc). This changes the hash and the system sees it as two different files. Then they cannot act as compatible sources.

No it's based off of faulty principals that the p2p community abandoned for a multitude of reasons.

BT uses a tree hashing method called DHT to split the file(s) into chunks and hash them, if any part of the torrent changes, the changed chunks are marked as corrupt if your client notices, or if it doesn't then you send it out but the receiver notices and discards it for not matching the hash.

oooo me likes. it appears not only are we increasing connection speeds between our home computers, but with the whole world.

Lt-DavidW said,
BitTorrent uses identically referenced files already, so no, this will make no difference for BitTorrent.
not really. it uses "parts" of single file. SET seems to use same patterns in ANY file. so you got a pattern in file.exe and file.doc so you download that pattern no matter which file.

SHADOW-XIII said,
not really. it uses "parts" of single file. SET seems to use same patterns in ANY file. so you got a pattern in file.exe and file.doc so you download that pattern no matter which file.

Hmm.. on-the-fly compression?

SHADOW-XIII said,
not really. it uses "parts" of single file. SET seems to use same patterns in ANY file. so you got a pattern in file.exe and file.doc so you download that pattern no matter which file.

And with that you greatly increase the chance of hash collisions causing corruption.

Sounds like all they've done is made a way for P2P clients based on searchable networks to get valid pieces of a file from partially corrupt sources (ie. someone was downloading the file on a bad connection and a few bytes got messed up). If that's indeed all this does then they're almost a decade behind, tiger tree hashing has already solved the core problem of preventing the corruption in the first place by redownloading a chunk of the file if it doesn't match the hash.

its a little more elaborate then that. it scans for identicle data and simply does not download it again, and dublicates what it already has.

so if one file is 1GB. but of that 1GB 300MB is identicle 'chunks' of data, it does not re-download. it simply duplicates what it already has. at least, thats what it sounds like.

Not exactly. What this does is search through other files available on the network and see if any of them have data chunks that are exactly like the file you are trying to download. For instance, say you are downloading a Linux distribution with a few extra hundred MB's of extra software, such as word processors and the like, included. Now, what if that file only has a few sources? Now, let's say the same distro, minus the extra apps, is available on the same p2p network. Normally, this wouldn't effect your download at all. With this idea, you download could source those data chunks that are similar to that vanilla distro and help speed along your more expanded distro's download.

Nose Nuggets said,
its a little more elaborate then that. it scans for identicle data and simply does not download it again, and dublicates what it already has.

so if one file is 1GB. but of that 1GB 300MB is identicle 'chunks' of data, it does not re-download. it simply duplicates what it already has. at least, thats what it sounds like.


To do that it does what was thought of long ago and quickly discarded due to technical non-feasibility. For a file to have chunks that match up at the same offsets it has to be the same file that's either corrupt or has had meta data (like id3 tags) changed. Some p2p protocols just strip out the meta data and send it separate from the media portion of the file and others just treat it as a corrupt file but still use a tree hashing method to send the chunks that are common. Both methods solve the problem better than SET does, did it around a decade beforehand, and don't have the huge overhead SET does of having to hash every 16k and send requests to the network for such tiny chunks. This method also has the problem of eventually having collisions where two chunks have the same hash so if they tried to allow something like downloading the similar portions of different linux distros that vary in size, you could easily end up with a different chunk that has the same hash as the chunk you're trying to get.

Impressive,that could be very useful!It wouldn't work for torrents though,would it?Also some software already do something similar to this,in that even if the filename is different,it recognises its the same and downloads from both.

thats cool. data domain has a backup product that works on a similar principal. instead of backing up every bit it only backs up different data, essentially making a 1TB volume able to hold 200-300TB of usable data (after compression).

ill be watching this for sure.

Ideas Man said,
So does the image format for Windows Vista and the backup solution included in Windows Home Server ;)

no. it would be impossible. Data Domains backup format is proprietary. but it would have to be, their is no way around it.

It would be impossible to get a stable percentage because, as you can see, the technology is completely dependent on other peoples files that were uploaded. So, if there's more of the same file, better speeds, meaning, higher percentage. So, as you can obviously see, if you're looking up a movie that's called "Frankenstein Does a Chicken in France on April 32nd", then you're going to probably not find as many people with the same file anywhere else, which equals a lower percentage.