|
| 1 | +<?xml version="1.0" encoding="utf-8" ?> |
| 2 | +<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> |
| 3 | +<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> |
| 4 | +<head> |
| 5 | +<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> |
| 6 | +<meta name="generator" content="Docutils 0.3.0: http://docutils.sourceforge.net/" /> |
| 7 | +<title>mailbox_date_trimmer</title> |
| 8 | +<link rel="stylesheet" href="default.css" type="text/css" /> |
| 9 | +</head> |
| 10 | +<body> |
| 11 | +<div class="document" id="mailbox-date-trimmer"> |
| 12 | +<h1 class="title">mailbox_date_trimmer</h1> |
| 13 | +<div class="section" id="scenario-description"> |
| 14 | +<h1><a name="scenario-description">Scenario description</a></h1> |
| 15 | +<p>You are a mailing list administrator, or you are somebody who keeps |
| 16 | +mailing list archives for a user group, or you just have a fetish |
| 17 | +for email archives. Now, don't you always wonder why after running |
| 18 | +<a class="reference" href="http://www.hypermail.org/">Hypermail</a> or <a class="reference" href="http://www.mhonarc.org/">MHonArc</a> on your archives you always have some emails |
| 19 | +which date back to 1980 or far away in the future like 2011, even |
| 20 | +though you started collecting emails in the year 2000 and it's still |
| 21 | +this very same year? While there are many answers to this question, |
| 22 | +there is a very easy way to fix many, if not all, of these messages, |
| 23 | +which results in a much more consistent email archive without broken |
| 24 | +discussion threads.</p> |
| 25 | +</div> |
| 26 | +<div class="section" id="how-does-it-work"> |
| 27 | +<h1><a name="how-does-it-work">How does it work?</a></h1> |
| 28 | +<p>Mailing lists with some activity register at least some messages |
| 29 | +every month, and luckily most of these emails have correct |
| 30 | +dates. This program iterates through the email archive of your |
| 31 | +choice (in Unix mailbox format) checking the date header of each |
| 32 | +email and comparing it to the date of the previous email. If the |
| 33 | +difference in time is greater than a month, the current email's |
| 34 | +date is considered invalid.</p> |
| 35 | +<p>When an email is sent to a mailing list, it is very likely that it |
| 36 | +<em>hops</em> through some computers before it reaches its audience. The |
| 37 | +good thing about this is that each <em>hop</em> adds to the email's |
| 38 | +headers a timestamp. People running email servers connected day |
| 39 | +and night to the internet usually set them up correctly (or face |
| 40 | +the consequences), so it is very unlikely that one of these added |
| 41 | +timestamps is incorrect. Also, email delivery tends to be pretty |
| 42 | +quick from one of these servers to another, with delays not bigger |
| 43 | +than minutes or even seconds in most circunstances.</p> |
| 44 | +<p>So when the current email's date is considered invalid, |
| 45 | +mailbox_date_trimmer finds all the date timestamps in the headers, |
| 46 | +and reading them in reverse order (servers add their headers to |
| 47 | +the beginning of the email) picks the first one whose difference |
| 48 | +to the previous email is smaller than one month (usually the first |
| 49 | +choice 99% of the time).</p> |
| 50 | +<p>In the broken cases where an email doesn't have ANY header at all, |
| 51 | +mailbox_date_trimmer adds to this email the time of the previos |
| 52 | +email plus one second. In the cases where the closest match doesn't |
| 53 | +fall in the expected one month timeframe, mailbox_date_trimmer gives |
| 54 | +up and doesn't add any header at all. The latter could happen with |
| 55 | +legitimate emails which you moved incorrectly to a folder, or you |
| 56 | +unsubscribed for holidays and resubscribed much later, etc.</p> |
| 57 | +<p>If your mailbox contains messages which fall into this category, |
| 58 | +tough luck, you will have to weed them out manually. For most of |
| 59 | +the other people in the world, rejoice, your calvary has come to |
| 60 | +an end, you can finally enjoy email archives with consistent dates.</p> |
| 61 | +</div> |
| 62 | +<div class="section" id="software-requisites"> |
| 63 | +<h1><a name="software-requisites">Software requisites</a></h1> |
| 64 | +<p>This software requires Python (<a class="reference" href="http://www.python.org">http://www.python.org</a>). It is known |
| 65 | +to work with versions 1.5.2 or 2.2.3. You also need my mailbox_reader |
| 66 | +module, which you should be able to get from:</p> |
| 67 | +<blockquote> |
| 68 | +<ul class="simple"> |
| 69 | +<li><a class="reference" href="http://gradha.sdf-eu.org/program/mailbox_reader.en.html">http://gradha.sdf-eu.org/program/mailbox_reader.en.html</a></li> |
| 70 | +<li><a class="reference" href="http://www.vex.net/parnassus/">http://www.vex.net/parnassus/</a></li> |
| 71 | +<li><a class="reference" href="http://freshmeat.net/">http://freshmeat.net/</a></li> |
| 72 | +</ul> |
| 73 | +</blockquote> |
| 74 | +</div> |
| 75 | +<div class="section" id="usage"> |
| 76 | +<h1><a name="usage">Usage</a></h1> |
| 77 | +<p>mailbox_date_trimmer is a commandline tool with pretty few options. |
| 78 | +Running mailbox_date_trimmer with the <tt class="literal"><span class="pre">-h</span></tt> or <tt class="literal"><span class="pre">--help</span></tt> arguments |
| 79 | +should bring up a help screen showing you how to use the program |
| 80 | +and with what switches. This program can read mailboxes from the |
| 81 | +hard disk or through standard input. It can also write new mailboxes |
| 82 | +or dump everything through standard output. The former means that |
| 83 | +if you run the program without arguments it will sit there idle |
| 84 | +waiting for your input, just like the grep command.</p> |
| 85 | +<p>You can run the program like a filter inside a more complex command |
| 86 | +chain. It consumes/produces data one email at a time, so you can |
| 87 | +feed it gigabytes of data and it should not run out of memory unless |
| 88 | +you have emails which don't fit in your available free memory, |
| 89 | +or you have another heavy weight process consuming all your memory.</p> |
| 90 | +<p>Note that while I have run this over all my personal mailing list |
| 91 | +archives, and the program is written in such a way that it should |
| 92 | +never do stupid things, hey, I'm a stupid human, and the computer |
| 93 | +just followed my instructions. So better make a safe backup of |
| 94 | +your mail archives before you use this software on them. Anyway, |
| 95 | +it has worked correctly with about 500MB of mail archives, which |
| 96 | +is all I have been able to get from internet and friends.</p> |
| 97 | +</div> |
| 98 | +<div class="section" id="checking-the-generated-output"> |
| 99 | +<h1><a name="checking-the-generated-output">Checking the generated output</a></h1> |
| 100 | +<p>In order to verify that mailbox_date_trimmer didn't break anything |
| 101 | +seriously, the first thing you should do is inspect the generated |
| 102 | +mail archive and count the number of emails, it should equal the |
| 103 | +number of emails in your original mailbox. If this is not the case, |
| 104 | +I'm sorry, I must have done something terrible. Drop me an email.</p> |
| 105 | +<p>The second thing you can do is go one email after another checking |
| 106 | +what dates were modified. When mailbox_date_trimmer modifies the |
| 107 | +date of an email, if the verbose switch has been used, the original |
| 108 | +date is stored in the header <tt class="literal"><span class="pre">X-DT</span></tt>. The reason of the change |
| 109 | +is stored in the header <tt class="literal"><span class="pre">X-pi</span></tt>. You can therefore extract all |
| 110 | +modified messages with the following command, if you have grepmail |
| 111 | +available on your machine:</p> |
| 112 | +<pre class="literal-block"> |
| 113 | +grepmail -h X-pi mailbox > changed |
| 114 | +</pre> |
| 115 | +<p>Now you can open this new mailbox and see easier which messages |
| 116 | +where modified. Don't you like these extra headers? Well, currently |
| 117 | +you have to grep them out yourself.</p> |
| 118 | +<p>You will notice that most of the dates the program generates are |
| 119 | +not accurate. I didn't bother to parse timezones, an error of some |
| 120 | +hours is irrelevant when the acceptation time frame is one month |
| 121 | +in both directions. Also, time operations are done using the local |
| 122 | +time of your machine, should be using UTC.</p> |
| 123 | +<p>However, little differences in time didn't cause me any problems |
| 124 | +at all. You are welcome to send me patches in unified diff format |
| 125 | +to improve this or any other aspect of the program. The current |
| 126 | +version satisfied all my neccesities, so it is quite unlikely that |
| 127 | +I will actively improve this software (no itch to scratch).</p> |
| 128 | +</div> |
| 129 | +<div class="section" id="contact-information"> |
| 130 | +<h1><a name="contact-information">Contact information</a></h1> |
| 131 | +<p>You should be able to get me through <a class="reference" href="mailto:gradha@users.sourceforge.net">gradha@users.sourceforge.net</a>. If |
| 132 | +this fails, try going to my web page (currently at |
| 133 | +<a class="reference" href="http://gradha.sdf-eu.org/">http://gradha.sdf-eu.org/</a>), my current email address is stamped at |
| 134 | +the bottom of most pages. If that URL fails, you could try Googling |
| 135 | +by "Grzegorz Adam Hankiewicz" (don't forget the quotes). Am I |
| 136 | +narcissistic or what? As if you ever wanted to know that much...</p> |
| 137 | +</div> |
| 138 | +<div class="section" id="license"> |
| 139 | +<h1><a name="license">License</a></h1> |
| 140 | +<p>This software is covered under the <a class="reference" href="http://www.gnu.org/licenses/licenses.html#GPL">GPL</a>. See the full license text |
| 141 | +in the provided LICENSE file.</p> |
| 142 | +</div> |
| 143 | +</div> |
| 144 | +</body> |
| 145 | +</html> |
0 commit comments