Indexing files w/i the NSRL HASH DB
Any way to exclude from the hashing process the "known" files from the NSRL Hash DB?
I don't understand the request.
Surely it is impossible to know if a file is in (or not in) a hash set before calculating the hash of the file?
It is only after the hash is calculated that is is possible to know if a file is in the NSRL list or not.
Sorry, I meant "indexing" process. Original post corrected.
Originally Posted by David (PassMark)
Well, maybe not ... I don't see an edit link in the posts.
At the moment no.
If you were thinking of doing this to save time during indexing, by avoiding the indexing of some files, it won't have a huge impact.
For both indexing and hashing the file needs to be opened and read, beginning to end.
The NSRL list contains mostly hashes for operating system files and application files. Last time I looked it was also fairly out of date.
For a quicker indexing of a whole drive consider skipping these folders,
c:\program files (x86)
You can enter in a list of folders to skip from the Advanced options in the 2nd step of "Create index"
Of course you might miss something skipping these folders, but on the other hand you might examine twice as many drives in the time available.
I'm really thinking along the lines of overall case efficiency. Another way to go about it would be to to list all files on the drive or partition being examined and run that list against the NSRL DB. And then export the non-NSRL files to a separate folder and then create the index from only the files in that folder.
That would not be efficient in general.
Lets take an example of a 100GB disk that is full of files. Assume around 40GB is system files and program application files (under c:\program files). Further assume the disk reads at around 28MB/sec (with mixed sequential and random access).
To hash all the files, you basically need to open and read every file on the hard drive. So at this point you have already spent 1 hour reading the disk (or maybe longer, as this excluding the computation time to calculate the hash).
Assume you get a 15% hit rate from NSRL from the 100GB of files. Which I think is optimistic given how out of date it is.
Then you need to copy the non hits to another location. This involves another read of 85% of disk. So this might be another 50min wasted copying 85GB of data.
Then you need to do the indexing on the copied files. Which might be another 50min gone.
So using your method the total time is, 2h40min.
Compare this to my suggestion of excluding system folders. Which would leave you with 60GB of document files to index. Which would only take 36min.
So my suggestion is 4 times more efficient (36min vs 2h40min). This is assuming you measure efficiency as the time required to index the most important files from a hard drive. Or consider using my method as a form of triage. If this quick indexing turns up something interesting, then you can go back and dig deeper via a full index (or creating a new index of just the missing folders).
The above is somewhat of a generalisation, as it assuming disk speed is the only limiting factor in both hashing and indexing. It also assumes the disk is near full In reality indexing does take a bit longer. So that does add a bias for your suggestion. On the other hand if you are also indexing unallocated disk space (which will never get a NSRL hit) that gives a bias to my suggestion. I think the outcome would be similar to above if you took all factors into account.
Or you can just throw hardware at it. If you are working with disk images, then make the original image to a fast SSD. A good SSD can be several times faster than a traditional HDD.