Saturday, March 10, 2012

Using Screenshots to Examine Many Files Quickly

I had a couple of projects that called for a way to examine a large number of files.  It seemed that screenshots could help in those projects.  This post describes the techniques I used.

EML Analysis

In one project, I was working with various email files that I had exported from Thunderbird.  These files had an EML extension.  Typically, if I viewed an EML file in Notepad, I would see various codes and other information that wouldn't be visible if I viewed it in an email program like Thunderbird.

I was interested in seeing the header codes in these EML files.  Those codes appeared at the tops of the files.  I felt that I could probably see what I needed to see in the first screenful of a Notepad session, opened maximized.

In other words, the concept was that I would open the EML file in Notepad; I would take a screenshot; I would save the screenshot; and then I would close the file and repeat the process with the next EML file on my list.  Then I would combine all those screenshots into one file, and flip through it or perhaps use other tools to analyze it further.  I wouldn't have to sit there, maintaining constant attention while the process continued in real time; I could just review the outcome afterwards.  (For some purposes, an alternative would have been to combine or select from the text, without a graphical view.)

The first step was to build the list of EML files that I wanted to examine.  I moved them all into a single folder and used DIR and Excel to give me the list and to convert it into a series of batch commands.  There was one such command for each such EML file.  Before running those commands, I had to open Notepad once, turn on its Format > Word Wrap option, and then close it.  The format of the command was as follows:

start /max notepad "D:\Folder Name\Email Name.eml"
That command was sufficient to open the EML file.  Next, I needed to pause the system for a moment, so that the file would have time to come onscreen. Among numerous suggestions , I favored a command involving PING ("ping 1.1.1.1 -n 1 -w 1500 > nul") because of its fine-tunable setting (in the example just given, 1500 milliseconds).  Unfortunately, that command's output component (" > nul") would have prevented me from adding more commands on the same line.  So I had to go with "TIMEOUT /T 1" for a one-second delay.

Next, I needed a command to take a snapshot.  It looked like there were multiple options here.  I had already installed NirCmd and had found it useful for other things, so I used this command:
start NirCmd savescreenshot "D:\Folder Name\Screenshots\Email Name.png"
NirCmd came with an option to copy its executable (nircmd.exe) to C:\Windows, so that this command could run without any need to specify the location of NirCmd, to put a copy of it in the current working folder, or to modify the computer's Path.  NirCmd wasn't saving to subfolders properly, so in the end I had to modify that part of the command.

Finally, I needed a command to close Notepad. The advice that worked for me was:
taskkill /f /im notepad.exe
Note that this would close all currently open Notepad sessions.  These three steps (i.e., open the EML in Notepad, take a picture with NirCmd, close Notepad) would give me a screenshot of the first screenful's worth of the file's contents.  Collectively, those screenshots would give me a visual impression of the various kinds of codes appearing at the start of my EML files.

I used && to combine multiple commands on the same line, as a single (long) batch command. If that had failed, I could have added index columns next to the spreadsheet columns in which I built those two commands, with alternating even and odd numbers in those columns:  1 for the first Notepad command, 2 for the first NirCmd command, 3 for the second Notpad command, and so forth.  These index numbers would allow the various commands to be sorted into proper sequence in a single column, for copying and pasting into a batch file.

In short, for each EML file, I combined four commands with &&, into a single long command like this:
start /max notepad "D:\Folder Name\Email Name.eml" && timeout /t 1 && start NirCmd savescreenshot "Email Name.png" && taskkill /f /im notepad.exe
This gave me some PNGs.  Now there was the question of what to do with them.  One option was to simply stitch them together in a slideshow (using e.g., IrfanView) or a single PDF (using e.g., Acrobat).  I did a brief investigation of OCR software for that purpose.  Ultimately, I just used IrfanView, without even creating a slideshow, to arrow down through those PNGs, one at a time, at whatever pace I chose.  So I could look at whether each page came through OK.

PDF Analysis

In another project, I had a bunch of PDFs that I had created in a conversion process.  I wanted to check if the PDFs came through OK.  It would have been very slow to open them, one at a time, and page through them.  Combining them all into a single large PDF, which I could also page through, would have produced a huge file.  Also, if I was working with large PDFs or many PDFs (or both), I might have to look at huge numbers of pages.  Boredom or haste could lead me to flip past an important one-page document, while checking hundreds or thousands of less important pages.

Based on various factors (including the number of PDFs, their importance, and the time available), I decided to examine just the first page of each PDF.  I might not be able to tell if the whole document printed properly, but at least I could eliminate those instances where printing failed completely.

For this purpose, the process described in the previous section offered one possibility.  I could probably work up a set of commands to open a PDF, take a screenshot, and then close it, and then flip through the resulting screenshots.

I did not actually pursue that approach in this case, however.  Instead, I wanted to see if I could convert the PDF documents to JPG and then flip through just the first page from each such document.  If I had a hundred documents to check, I would have a hundred pages to look at -- not a thousand. A separate post discusses that investigation.  The tool I chose was Boxoft PDF to JPG Converter .  Another way to proceed might have been to split the PDFs first, using something like PDFsam , and then combine the PDFs of each resulting first page into a larger PDF that I could flip through.