Document scanning

workflow

In order to get to the point it’s time to finalize my document workflow and write it down. This is basically how I treat incoming paper mail, how I store it and process it further in order to avoid the paper piling up on my desk.

This process has grown through a couple of years and got initiated when the paper really started filling up the shelfs but got rarely used. That was actually quite a common situation, but I didn’t know how to handle it better. So one week I decided it was enough, borrowed a scanner and started scanning all documents that I had. That took me about a week. Sorting and storing them afterwards: one month.

At the end/now I’m left with no paper on my desk, no binders in the shelf and all documents conveniently in PDF stored and archived.

For me this is the best working solution I found so far.


Problem

Paper. Short and simple. The postal system invented a long time ago (B.C) is still sending paper around. This itself isn’t a problem (beside the Spam I get, like Newspapers and circulars), but storing the information on it is a growing one. Literally. Invoices, contracts, all that sort of shit that you feel guilty throwing away and you suspect somebody taking advantage of it when you don’t have letters to prove something.

This starts to fill up your cupboards and storage room. And you rarely use it until you really need to.


Solution

I see two solutions here

Do not care cycle
  1. Don’t care

    Seriously. You can probably get pretty far without even caring. There must be a lot of people - probably most - somewhere in the limbo between having the perfect document managing system and none at all (called the “Big Box”). I’d imagine you get through months or years of having a quite and happy life until somebody once in a while makes a mess out of it. And most document you can most likely ask for again when you need them.

  2. Get a grip on it.

    The obvious solution is to take care of all the stuff being sent to you and store it somehow. Binders are the easiest way, but they come with flaws (mentioned further below). Going the next step and storing it digital, solves a couple of the binders problems, but introduces some new ones. I’ve decided I want to live with the digital problems and not the physical ones.

    So this is what I came up with.


Workflow

Creating

workflow

I’ve bought myself a nice and small document scanner from Canon: Canon DR-2010C. I’m scanning way more documents than pictures, a document-scanner makes therefore more sense to me than a flat-bed scanner. And it enables me to scan multiple pages, both sides in one run. Additional functionalities like emailing the document, etc were secondary for me. I’m next to the scanner anyway when scanning.

Creating the electronic documents from the paper ones involves only a couple of steps:

  1. Scan document

  2. Add minimum meta-data

  3. Save the document.

  4. Sort the document.

Scanning

All documents are scanned with 300dpi and black/white settings. I don’t need to re-produce a document to 100% in case, a copy will be sufficient. A resolution of 300dpi makes sure I can print the document and still read everything. The Black-and-White setting reduces the document size and keeps the amount of disk storage I need to use low. Occasionally, when the result with black and white isn’t good enough (e.g. Ball-pens aren’t scan able that way), I switch to grey mode. I can’t remember that a colored scan was necessary for me once.

I’m usually using gScanPDF to scan the documents. Applying the meta-data in the next step is also quite easy with it.

Meta-Data

There’s a minimum set of meta data I’ve defined for myself to follow each and every document.

  1. Document-date

    The date the document was created. Not by me, but by the author. Usually this it the date printed on in the document header. If there’s not date at all on the document - it’s the date I either got the document or scanned it.

  2. Author of the document

    Who send the document to me, where did I get it from. If it’s an invoice, the shop’s name will be the author.

  3. Document-type

    I differ between a couple of document types: Invoices, Manuals, Order-confirmations, Informations, Receipts and Contracts. The _Information_ type is the most general one. If I can’t put it anywhere else, it’s a document of the type _information_.

  4. Subject of the document

    This specifies the document within the document type. Let’s say the document is an invoice, so the subject will be the invoice number. If it’s a contract, the subject is the contract number. Anything that specifies the document and tells me what it’s about, will fit.

  5. Keywords (optional)

    The keywords are mainly there to satisfy all my other needs. When there’s an information which I think should be noted somewhere but it doesn’t fit in the scheme - it’s a keyword.

The meta-data is always applied to the document, no matter what.

Saving

I’m storing my documents in the PDF format. So the file format supports all those meta-data fields I’ve mentioned above.

When receiving a PDF document via e-mail, I update the meta-data, because only handful of senders really care about the meta-data in the file.

All files are stored in a specific folder structure and have a specific filename:

/-
 | homefolder
  - author_1
  - author_2
  - author_3
  - author_4
  - author_5
  - ...

This structure is quite simple, but it works for me. Somewhere on my computer there’s the homefolder which contains all the documents I’ve received. In it, for each and every document-author, there’s a sub-directory with the name of the author. That way I’m grouping all documents by the author who had sent the document to me. I stick to the following conventions regarding the sub-directory names:

  1. Only lower-case letters in the folder names.

  2. All spaces in the names are replaced with underscores (”_”).

  3. In case of Communities it made sense for me to introduce a parent folder _communities_. But I am not sure I’m gonna keep it that way.

  4. There at least one folder called zz_archive which includes all folder which haven’t been received any document within the last five years.

Within the sub-directories I’ve tried a while ago to create an even deeper folder structure for e.g. invoices, contracts, etc,etc, but the number of documents in each folder never got out of hand so that I gave up on that. All documents are in one folder.

When it comes to filenames, I put some of the meta-data into the filename. Each file has the following structure:

<YYYYmmdd>-<author>-<documenttype{3}><subject>[-<keywords>].<ext>
  • YYYYmmdd

    This is the document date from the meta-data in the format _YearMonthDay_ - no exception.

  • author

    Is the author of the document as described above. It the author has more than one word they are connected with underscores (”_”) and not(!) spaces.

  • documenttype{3}

    Is a three letter abbreviation of the document-type from the meta data. I usually use the first three letter of the category I’ve chosen for myself. E.g. for invoices the abbreviation would be: inv, information would be inf, contract would be con, etc., etc., etc., …

  • subject

    Is the subject of the document as used in the meta-data. It’s not separated with a hyphen from the document-type abbreviation. Invoices so have the identifier inv and the invoice number, e.g. inv123456.

  • keywords

    Are whatever I feel might help me to describe the document better so that I know right away what it’s about. This can be customer number (e.g. cus1234566) or the product I’ve bought with the invoice (e.g. -canon_dr2010).

  • ext

    Is the file extension; usually PDF. I sometimes tend to use this scheme with other files as well like images.

A complete filename would look like this:

19700101-amazon_co_uk-fak9999999-usb_stick.pdf

I’m still working on a script that checks the available PDF data in the filename, renames the files, complains about mismatches and moves the files to the right location. I’ve gotten pretty far, but I haven’t finished yet. In terms of the work flow this would save me the whole step of taking care about the naming and storing of the files since this would then happen automatically.

Sorting

The sorting of all documents basically happens automatically. Most shells can sort by filename. Since every filename is beginning with the document date they are all sorted chronologically. The naming is quite consistent as well, so it’s easy to spot a certain document.

Finding

Now the interesting part: Finding documents. Yes, I could go into the directory of the document author and search for the document. Even though this would work; it doesn’t feel quick enough. A simple liner on the command shell gives me all the documents I want:

$ cat ~/bin/finddoc
#!/bin/bash
LOCATIONS='/path/to/location/1 /path/to/location/2'
echo "searching for $1..."
for location in  $LOCATIONS ; do
  # Search in current location
  find "${location}" -iname "*$1*" -type f
done

This in a bash script I’ve called finddoc takes as parameter whatever I’m looking for in the filename. That’s why the author and the subject as well as some keywords are in there.

Deleting

Deleting? What? I guess I should have called this destruction instead. The old paper documents can’t just be disposed in the trash-bin. Now, I’ve got a paper shredder under my desk which all documents coming out of the scanner meet quite soon after. He’s not only cutting into lines but also in particles. This is the most satisfying part of the whole process for me. Still, after several years. Nothing is better than dropping something into a shredder.

I empty it once in a while - not very often actually- and drop the stuff into the containers. If somebody considers that important enough to puzzle it back together - you deserve it.

Storing

What the shelf is for the binders, that is the hard-disk for the electronic documents. Here comes the part where it’s getting a bit expensive. While the initial cost for a shelf and the binders are not low but reasonable, it’s not likely that the paper structure will just fail one day and you can’t read the document (unless you store it in the moist basement or so). This unfortunately happens with hard-disks. So you need to of them and keep them mirrored in case something breaks.

A cheaper solution might be a quick backup every-time on a USB disk or stick, depending on the number of documents you have. Then you might even get away cheaper than with a shelf.

Keep in mind that disks will fail and you need more than one copy of your documents.

To give you an idea about the disk-usage:

I’ve got about 2500 PDF-documents and they use up to 1.8GB currently. That’s an average of ~770KB per file. Hm.. is it really that much? Sounds way to big…


Problems

The binder solution

  • Physical space.

    Simple as that. The more binders you have, the more space you need to store them. The amount of storage space needed is proportional to the amount of binder you have/want.

  • Inflexibility.

    You make a binder where you put all the documents that you (for some reason) want to put under the letter “A”. The size of the binder defines how much paper you can actually store without opening another binder with the same topic. You either need to find a bigger binder or live with the fact that you have two locations where your document can be.

    Also re-structuring your documents is a kind of a hassle. If you imaging 500 lose papers flying around on your desk just to put them somewhere else - enjoy.

  • Exponential costs.

    This goes hand in hand with defining the organizational structure.

    Basically every change your in structure that goes together with growth will costs extra money. New Binder = Cost.

  • Weight.

    Move to a new apartment with two, three boxes only with binders - you can thank yourself for that.

  • Searchability.

    Paper is not good to search through. Unless you’re very, very disciplined and you know where too look (and given that the amount of data isn’t too big), I’d say you’re almost every time slower with the paper than electronically.

  • Reproducibility.

    Sending a copy is bit hard with out a copy machine.

The digital solution

  • Reproducibility.

    It’s much harder to get somebody else to join in into this solution than into the analog binder solution. It not only requires more knowledge about meta-data and file format, resolution and coloring, it’s also about the next point:

  • Discipline.

    This really is about discipline. If you stop scanning documents, they’ll pile up like shit on our desk and the line to cross to get started is becoming harder to cross every day.

  • Meta-Data.

    You need to maintain more meta-data than with the binder solution (where the only meta-data was e.g. the binder labelled “A”, but the document itself pretty much didn’t have any additional information.

  • Initial costs.

    To get started you need a bit more equipment than only binders and a shelf - a scanner and a workstation, a shredder optionally. That will set you back quite a bit compared to the hardware costs of the other solution. But it will safe you space and time afterwards.

  • Skill-level.

    One might argue that running the paper organization requires a certain level of knowledge as well - agreed. But I’d say you need more to setup the electronically one. Try it yourself, if you disagree.