Skip to content

Commit 6ef012a

Browse files
author
Peter McCluskey
committed
Add mailbox_date_trimmer
1 parent 77a7e1d commit 6ef012a

File tree

6 files changed

+1264
-2
lines changed

6 files changed

+1264
-2
lines changed

Changelog

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
Version Changes for Hypermail
22
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3+
Peter McCluskey (Sep 29, 2004)
4+
Add support for JAVT timezone.
5+
Add mailbox_date_trimmer to contrib, faq.
6+
37
Peter McCluskey (Jun 2, 2004)
48
Add language code substitution cookie patch from Shane Wegner.
59

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
<?xml version="1.0" encoding="utf-8" ?>
2+
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
3+
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
4+
<head>
5+
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
6+
<meta name="generator" content="Docutils 0.3.0: http://docutils.sourceforge.net/" />
7+
<title>mailbox_date_trimmer</title>
8+
<link rel="stylesheet" href="default.css" type="text/css" />
9+
</head>
10+
<body>
11+
<div class="document" id="mailbox-date-trimmer">
12+
<h1 class="title">mailbox_date_trimmer</h1>
13+
<div class="section" id="scenario-description">
14+
<h1><a name="scenario-description">Scenario description</a></h1>
15+
<p>You are a mailing list administrator, or you are somebody who keeps
16+
mailing list archives for a user group, or you just have a fetish
17+
for email archives. Now, don't you always wonder why after running
18+
<a class="reference" href="http://www.hypermail.org/">Hypermail</a> or <a class="reference" href="http://www.mhonarc.org/">MHonArc</a> on your archives you always have some emails
19+
which date back to 1980 or far away in the future like 2011, even
20+
though you started collecting emails in the year 2000 and it's still
21+
this very same year? While there are many answers to this question,
22+
there is a very easy way to fix many, if not all, of these messages,
23+
which results in a much more consistent email archive without broken
24+
discussion threads.</p>
25+
</div>
26+
<div class="section" id="how-does-it-work">
27+
<h1><a name="how-does-it-work">How does it work?</a></h1>
28+
<p>Mailing lists with some activity register at least some messages
29+
every month, and luckily most of these emails have correct
30+
dates. This program iterates through the email archive of your
31+
choice (in Unix mailbox format) checking the date header of each
32+
email and comparing it to the date of the previous email. If the
33+
difference in time is greater than a month, the current email's
34+
date is considered invalid.</p>
35+
<p>When an email is sent to a mailing list, it is very likely that it
36+
<em>hops</em> through some computers before it reaches its audience. The
37+
good thing about this is that each <em>hop</em> adds to the email's
38+
headers a timestamp. People running email servers connected day
39+
and night to the internet usually set them up correctly (or face
40+
the consequences), so it is very unlikely that one of these added
41+
timestamps is incorrect. Also, email delivery tends to be pretty
42+
quick from one of these servers to another, with delays not bigger
43+
than minutes or even seconds in most circunstances.</p>
44+
<p>So when the current email's date is considered invalid,
45+
mailbox_date_trimmer finds all the date timestamps in the headers,
46+
and reading them in reverse order (servers add their headers to
47+
the beginning of the email) picks the first one whose difference
48+
to the previous email is smaller than one month (usually the first
49+
choice 99% of the time).</p>
50+
<p>In the broken cases where an email doesn't have ANY header at all,
51+
mailbox_date_trimmer adds to this email the time of the previos
52+
email plus one second. In the cases where the closest match doesn't
53+
fall in the expected one month timeframe, mailbox_date_trimmer gives
54+
up and doesn't add any header at all. The latter could happen with
55+
legitimate emails which you moved incorrectly to a folder, or you
56+
unsubscribed for holidays and resubscribed much later, etc.</p>
57+
<p>If your mailbox contains messages which fall into this category,
58+
tough luck, you will have to weed them out manually. For most of
59+
the other people in the world, rejoice, your calvary has come to
60+
an end, you can finally enjoy email archives with consistent dates.</p>
61+
</div>
62+
<div class="section" id="software-requisites">
63+
<h1><a name="software-requisites">Software requisites</a></h1>
64+
<p>This software requires Python (<a class="reference" href="http://www.python.org">http://www.python.org</a>). It is known
65+
to work with versions 1.5.2 or 2.2.3. You also need my mailbox_reader
66+
module, which you should be able to get from:</p>
67+
<blockquote>
68+
<ul class="simple">
69+
<li><a class="reference" href="http://gradha.sdf-eu.org/program/mailbox_reader.en.html">http://gradha.sdf-eu.org/program/mailbox_reader.en.html</a></li>
70+
<li><a class="reference" href="http://www.vex.net/parnassus/">http://www.vex.net/parnassus/</a></li>
71+
<li><a class="reference" href="http://freshmeat.net/">http://freshmeat.net/</a></li>
72+
</ul>
73+
</blockquote>
74+
</div>
75+
<div class="section" id="usage">
76+
<h1><a name="usage">Usage</a></h1>
77+
<p>mailbox_date_trimmer is a commandline tool with pretty few options.
78+
Running mailbox_date_trimmer with the <tt class="literal"><span class="pre">-h</span></tt> or <tt class="literal"><span class="pre">--help</span></tt> arguments
79+
should bring up a help screen showing you how to use the program
80+
and with what switches. This program can read mailboxes from the
81+
hard disk or through standard input. It can also write new mailboxes
82+
or dump everything through standard output. The former means that
83+
if you run the program without arguments it will sit there idle
84+
waiting for your input, just like the grep command.</p>
85+
<p>You can run the program like a filter inside a more complex command
86+
chain. It consumes/produces data one email at a time, so you can
87+
feed it gigabytes of data and it should not run out of memory unless
88+
you have emails which don't fit in your available free memory,
89+
or you have another heavy weight process consuming all your memory.</p>
90+
<p>Note that while I have run this over all my personal mailing list
91+
archives, and the program is written in such a way that it should
92+
never do stupid things, hey, I'm a stupid human, and the computer
93+
just followed my instructions. So better make a safe backup of
94+
your mail archives before you use this software on them. Anyway,
95+
it has worked correctly with about 500MB of mail archives, which
96+
is all I have been able to get from internet and friends.</p>
97+
</div>
98+
<div class="section" id="checking-the-generated-output">
99+
<h1><a name="checking-the-generated-output">Checking the generated output</a></h1>
100+
<p>In order to verify that mailbox_date_trimmer didn't break anything
101+
seriously, the first thing you should do is inspect the generated
102+
mail archive and count the number of emails, it should equal the
103+
number of emails in your original mailbox. If this is not the case,
104+
I'm sorry, I must have done something terrible. Drop me an email.</p>
105+
<p>The second thing you can do is go one email after another checking
106+
what dates were modified. When mailbox_date_trimmer modifies the
107+
date of an email, if the verbose switch has been used, the original
108+
date is stored in the header <tt class="literal"><span class="pre">X-DT</span></tt>. The reason of the change
109+
is stored in the header <tt class="literal"><span class="pre">X-pi</span></tt>. You can therefore extract all
110+
modified messages with the following command, if you have grepmail
111+
available on your machine:</p>
112+
<pre class="literal-block">
113+
grepmail -h X-pi mailbox &gt; changed
114+
</pre>
115+
<p>Now you can open this new mailbox and see easier which messages
116+
where modified. Don't you like these extra headers? Well, currently
117+
you have to grep them out yourself.</p>
118+
<p>You will notice that most of the dates the program generates are
119+
not accurate. I didn't bother to parse timezones, an error of some
120+
hours is irrelevant when the acceptation time frame is one month
121+
in both directions. Also, time operations are done using the local
122+
time of your machine, should be using UTC.</p>
123+
<p>However, little differences in time didn't cause me any problems
124+
at all. You are welcome to send me patches in unified diff format
125+
to improve this or any other aspect of the program. The current
126+
version satisfied all my neccesities, so it is quite unlikely that
127+
I will actively improve this software (no itch to scratch).</p>
128+
</div>
129+
<div class="section" id="contact-information">
130+
<h1><a name="contact-information">Contact information</a></h1>
131+
<p>You should be able to get me through <a class="reference" href="mailto:gradha&#64;users.sourceforge.net">gradha&#64;users.sourceforge.net</a>. If
132+
this fails, try going to my web page (currently at
133+
<a class="reference" href="http://gradha.sdf-eu.org/">http://gradha.sdf-eu.org/</a>), my current email address is stamped at
134+
the bottom of most pages. If that URL fails, you could try Googling
135+
by &quot;Grzegorz Adam Hankiewicz&quot; (don't forget the quotes). Am I
136+
narcissistic or what? As if you ever wanted to know that much...</p>
137+
</div>
138+
<div class="section" id="license">
139+
<h1><a name="license">License</a></h1>
140+
<p>This software is covered under the <a class="reference" href="http://www.gnu.org/licenses/licenses.html#GPL">GPL</a>. See the full license text
141+
in the provided LICENSE file.</p>
142+
</div>
143+
</div>
144+
</body>
145+
</html>

0 commit comments

Comments
 (0)