Information retrieval from $HOME

Tutorial English Information retrieval from $HOME

Like everyone else, when I first encountered tree directory systems, I thought they were a marvelous way to organize information. I've been around computers since 1983, and have staunchly struggled to keep files and directories neatly organized. My physical filing cabinet has always been a mess, but I clung to the hope that my hard disk would be perfect.

For many years, I could draw my full tree directory from memory. Things have changed; I'm doing more things than I can track. Today, my $HOME is 2.4k directories, 43k files, and 1.3G bytes (this is almost all plain ASCII files -- no MS Office, no multimedia -- so 1.3G is a lot). My present filesystem has been uninterruptedly with me since 1993, and there are old things in there that I can scarcely remember. Now, I often wander around $HOME like a stranger, using file completion and "locate" to feel my way around. I recently needed some HTML files that I was sure I had once written, but I didn't know where they were. I found myself reduced to saying:

$ find ~ -name '*.html' -print | xargs egrep -il string

, which is a new low in terms of having no idea where things might be.

This article is a plea for help. We're all used to devoting effort to problems of information retrieval on the net. I think it's worth worrying about inner space. What lies beneath, under $HOME? How can relevant information and files be pulled up when needed? How can we navigate our own HOMEs with less bewilderment and confusion? Can software help us do this better? I know nothing about the literature on information retrieval, but this scratches my itch.

Multiplicity of trees

We have accumulated three different tree systems for organizing different pieces of information:

- The filesystem
- Email folders
- Web browser bookmarks

This is a mess. There should be only one filesystem, one set of folders.

Email is a major culprit. Everyone I know uses a sparse set of email folders and an elaborate filesystem, so we innately cut corners in organizing email.

We really need to make up our minds about how we treat email. Is email a channel, containing material which is in transit from the outside world to the "real" filesystem? In this case, the really important pieces of mail will get stored in their proper directory somewhere, and all other pieces of email will die. I have tried to achieve this principle in my life, with limited success.

Or is email permanent (as it is for most people), in which case material on any subject is fragmented between the directory system and email folders? If so, can email folders automatically adopt the organization of the directory system? Can email files be placed alongside the rest of the filesystem?

Web browser bookmarks are a third tree-structured organization which should not exist. It's easy to have a concept of having a metadata.html file in every directory, and storing the bookmarks there. The browser would inherit the tree directory structure of $HOME, and when sitting inside any one directory, the pertinent metadata would be handy.

Dhananjay Bal Sathe pointed out to me another source of escalation of the complexity of filesystems. This only effects users of software from Microsoft, so I'd never encountered it. It is MS's notion of "compound files", which are objects which look like normal files to the OS but are actually full directory systems (I guess they're like tarfiles). Since the content is hidden inside the compound files, you cannot use all OS tools for navigating inside this little filesystem, only the application that made the compound file. He feels that if compound files had been treated as ordinary directories of the filesystem, it would have been a "simple, beautiful, elegant" and largely acceptable solution instead of the mess which compound files have created.

Non-text files

If you use file utilities to navigate and search inside the filesystem, you will encounter some email. I use the "maildir" format, which is nice in that each piece of email lies in a separate file. However, MIME formats are a problem. When useful text is kept in MIME form, it's harder for tools to search for and access it.

MIME is probably a good idea when it comes to moving documents from one computer to another, but it seems to me that once email reaches its destination, it is better to store files in their native format.

In my dream world, each directory has all the material on a subject (files, email, or metadata), and grep would work correctly, without being blocked by MIME-encoded files.

Geetanjali Sampemane pointed out that this is related to the questions about content-based filesystems, and suggested I look at a paper by Burra Gopal and Udi Manber on the subject (ask Google for it).

PDF and postscript documents

Postscript and PDF have worked wonders for document transmission over the Internet, but this has helped escalate the complexity of inner space:

As with MIME, .ps and .pdf files are not vulnerable to searches for regular expressions as text files are.
An interesting and subtle consequence of the proliferation of .ps and .pdf files in my filesystem is that a larger fraction of the files there are alien. In the olden days, every file that was in my filesystem was mine. It used my file naming conventions, etc., so when I wandered around my filesystem, I knew my way. Today, there are so many alien files hanging around that it reduces my confidence that I know what is going on.
Every now and then, I notice a .pdf file "which is going to be invaluable someday", and snarf it. If I'm lucky, it has a sensible filename, and if I'm lucky, I'll place it in the correct place in my filesystem. In this case, there's a bit of a hope that it'll get used nicely in the future. Unfortunately, a lot of people use incomprehensible names for .pdf files, such as ms6401.pdf, seiler.pdf, D53CCFF4C9021C19988841169FB6FD6EC1D56F711.pdf, and sr133.pdf. I find that interactive programs like Web browsers, email programs, etc. are clumsy at navigating tree directories, so my habit is to save into /tmp, then move the file using the commandline. Sometimes, I'm in too much of a hurry, and this gets messed up. Now and then, I place an incoming file into $HOME/JUNKPDF, hoping that I'll get around to organizing it later.

While I'm on this subject, I should describe a file naming convention I've evolved which seems to work well. I like it if a file is named Authoryyyy_string.pdf; this encodes the lastname of the author, the year, and a few bytes of a description of what this file is about. For example, I use the filename SrinivasanShah2001_fastervar.pdf for a paper written by Srinivasan and Shah in 2001 about doing VaR faster.

I also take care to use this Authoryyyy_string as the key in my .bib file, so it's easy to move between the bibliography file and the documents. I often use regular expression searches on my bibliography file, and once I know I want a document, I just say locate Authoryyyy to track it down.

Some suggestions

I'm not an expert on information retrieval, so these are just some ideas on what might be possible, from a user perspective.

Email and Web bookmarks. As mentioned above, we really need a solution to the problem of email folders versus Web bookmark folders versus the filesystem. I'd like to have a MUA and a Web browser which treat my normal filesystem as the classification scheme to use, and save information in the corresponding directories. Every time I make changes to the directory structure, the MUA and browser should automatically use the newest one.
Fulltext search. I think we should have fulltext search engines which are hooked into the filesystem. Every time a file under $HOME changes, the search engine should update its indexes. Like Google, this search engine should walk into .html, .pdf, and .ps files and index all the text found therein. This will give us the ability to search inside inner space.
URLs-as-symlinks. If we had a fulltext search engine which worked on $HOME, it'd be nice if we could have a concept of a symlink which links to a URL. This reduces overhead in the filesystem, and ensures that one is always accessing the most recent version of the file (in return, one suffers from the problem of stale links, but hopefully producers of information will be careful to leave redirects). By placing symlinks into my directory, I'd feed PDF or PS files into the universe that my personal search engine indexes. These files would be just as usable as normal downloaded files as far as Unix operations such as reading, printing, emailing, etc. are concerned. Web browsers should give me a choice between downloading the file and placing a symlink with a filename of my choice in a directory of my choice.

Dhananjay Bal Sathe reminded me that there is a good case for doing this on a more ambitious scale, to comprehensively support URLs as files so one would be able to say
$ cp URL file
or

$ lynx http://fqdn/path/a.html

:-) and it should work just fine. This goes beyond just symlinks.

Digital libraries. I have seen software systems like Greenstone which do a good job of being digital library managers, and they may be part of the solution.
I have sometimes toyed with the idea of using a digital library manager for all alien files. I could have a lifestyle in which every time I got a .pdf or .ps file from the net, I would simply toss it at the digital library software. (It would be nice if Mozilla and wget supported such a lifestyle with fewer keystrokes.) The digital library manager of my dreams would extract all the text from these files and fulltext index them (something that most library managers do not do), and it would not force me to type too much information about the file (which most of them do).
The logical next step of this idea is a digital library manager which just scours my $HOME ferreting out all files and fulltext indexing them, and that seems like a better course. In this case, it's just my fulltext search engine which indexes everything in $HOME.
Bibliographical information for the library manager. One path for progress could be for people who publish .pdf and .ps documents on the Web to use some standards through which XML files containing bibliographical information about them are also available. Every URL http://path/file.pdf should be accompanied with a http://path/file.bib.xml, which contains the information.

I know one initiative -- RePEC -- in which people supplying .pdf or .ps files also supply bibliographical information about them, but I think it's not quite there yet; it requires too much overhead. The proposal above is simpler. Every time a client fetches http://path/file.pdf, it can test for the existence of http://path/file.bib.xml, and if that's found, the user is spared the pain of typing bibliographical information into his digital library manager.
A user interface for supplying a path. When a file is being downloaded, the user is required to supply a filename and a path. I would really like it if authors of software (like Mozilla) gave us a commandline with file completion to do this. I find the GUI interaction that they force me to have extremely inefficient, and it costs so much time that when I'm in a hurry, I tend to misclassify an incoming file. File completion is the fastest way to locate a directory inside a filesystem, and I think I should at least have the choice of configuring Mozilla to use it instead of the silly GUI interface. When we re-engineer Unix to make it easy-to-learn, we should not give up easy-to-use.
Quality scoring in inner space. A search string will get hundreds of hits on a fulltext search engine, so how can software give us a better sense of which are the important documents and which aren't? In the problem of searching inside inner space, Google's technology (of counting hyperlinks to you) will not work. A few things that might help in inventing heuristics:

The most recently read or written files should be treated as more important.
Files that are accessed more often should be treated as more important. (This will require instrumenting the filesystem component inside the kernel.)
Makefiles articulate relationships between files. An information retrieval tool that crawls around $HOME should use this information when it exists. Targets in makefiles are less important, and files mentioned in make clean or make squeaky are less important.
As an example, such intelligence would really help an information retrieval tool which hit my $HOME. In every document directory, I have a Makefile, and the tool could use it to learn that a few .tex files matter, and the .pdf or .ps files do not (since they are all produced by the Makefile, and mentioned in make clean and make squeaky).
"My files are more important than files by others" is a useful principle, but it's difficult to accurately know the authorship of a file. The URLs-as-symlinks idea (mentioned earlier) can help. If I have snarfed a .pdf file down into a directory, the search engine has no way of knowing that it's an alien file. If I have left a symlink to the .pdf file, the search engine knows this should be indexed, but at a lower priority.

Less is more -- how to store less. One way to reduce the complexity of the filesystem is to help people feel comfortable about not downloading from the net. When I see a page on the net that looks interesting, I tend to download it and keep a local copy, partly because I'm thinking that I might not be able to find it later.
Instead, I'd like to hit a button on the browser which talks to Google and says "I think this page could be useful to me." From this point on, when I do searches with Google, this page should earn a higher relevancy score. If a large number of people used Google in this fashion, it would be a new and powerful way for Google to obtain information about the quality of pages on the Web.
Superstrings. I think we need a tool called superstrings which thinks intelligently about the files it is facing. If the file it faces is a normal textfile, superstrings is just strings(1), but if it faces .pdf, .ps, MIME, etc. it should extract the useful text with greater intelligence than ordinary strings(1). This can be combined with grep, etc., to improve tools for information access in the filesystem.
Help me delete files. Deleting files is one important way of reducing complexity. I'd like to get data about what parts of my filesystem I am never reading/touching. I could launch into spring cleaning every now and then and blow away files and directories that are really obsolete, supported by evidence about what I tend to use and what I tend to ignore. Note that I'm only envisioning a decision support tool, not an automated tool which deletes infrequently-used files. (Once again, this will require instrumenting the filesystem component inside the kernel.)

In summary, people working in information retrieval are focused on searching the Web, but I think we have a real problem lurking in our backyard. Many of us are finding it harder and harder to navigate inside our HOMEs and find the stuff we need. I think it's worth putting some effort into making things better. There is a lot that ye designers of software can do to help, ranging from putting file completion into Mozilla to new ideas in indexing tools.