Creating Custom Userlists from Document Metadata

In the past on the podcast we’ve talked about a number of tools for document metadata gathering and how we can use them for gathering good information.
I’ve talked about EXIFtool for examining and deleting metadata from JPEGs. This was helpful for some info, but only on images.
I’ve covered Metagoofil, where we use it to download all sorts of common data and word processing type documents and analyze them for interesting information. Unfortunatley, Metagoofil only will produce download from the web and process. We have no ability to process from our store on disk.
By accident I discovered that we can get much of the same information by using EXIFtool not on JPEGs, but on Word, Excel and PowerPoint documents! EXIFtool has the ability to parse metadata as defined by the FlashPix standard, introduced in 1996 developed by Kodak, Hewlett-Packard and Microsoft. Microsoft still uses the format for documents and storing data. We can use EXIFtool to gather usernames from the documents.
Note: This will only work on Office documents were not created with Office 2007 (.docx), as the new version relies on a different metadata storage format. I’ll have a solution for this one soon!
We can start down and dirty with getting the information on Office documents. In the directory that contains our supported office documents, we can execute the following commmand:

$ exiftool -r -h -a -u -g1 * >output.html

This will execute EXIFtool to extract all EXIF metadata recursively in the current directory (-r), with all output including duplicates (-a), organizing by EXIF tag category (âg1), for all files, with HTML friendly formatting (-h), into a file named output.html in the current directory (>output.html). With this we get a handy little report HTML report!
But, we may only want just the info on usernames/authors. We can trim the output information down to jsut the appropriate data elements:

$ exiftool -r -a -u -Author -LastSavedBy * >users.txt

We’ve removed the HTML and sorting options, as they will only serve to make any additional processing difficult. I’ve also only grabbed the Author and LastSavedBy tags, as these are the most common places for usernames. Now we can take our users.txt, and remove all of the extra information with some unix text processing:

$ strings users.txt | cut -d":" -f2 | grep -v "=" | grep -v "image files read" | tr '[:space:]' 'n' | sort | uniq  >cleanusers.txt

Now all we are left with is a list of potential user names one per line. We’ve dropped all of the extra text up to the first delimiter (:), dropped the lines that start with “=” and “image files read”, coverted spaces to newlines, sorted alphabetically and removed the duplicates. This will introduce some need for a manual culling, as sometimes the author is listed as “Firstname Lastname”, and they get kept as each name individually. However, in some smaller companies just a first or last name is perfectly acceptable as a username, so you may not want to to cull your list at all.
Now, we are left with a list of potential usernames that we can utilize for password brute force attempts for other services, such as VPNs or web based applications.