Skip to content

Wrong created_at values in commits.csv #97

@huichen-cs

Description

@huichen-cs

Looks to me that the most recent dump (2019-06-01) has wrong time stamp. Below is how we may reproduce the error.

In http://ghtorrent-downloads.ewi.tudelft.nl/mysql/mysql-2019-06-01.tar.gz, I found the following,

$ grep "2038-01-18" commits.csv
239581704,"1e954ef83c79ccccb8f570d350ea0ff934650fbb",4498938,4498938,18281400,"2038-01-18 23:00:00"
242299182,"0ebc3ea059dfcc8124135b94dfaa40ba7d49e5b2",1340592,1340592,18533076,"2038-01-18 23:00:00"
606676461,"d4cbfbc31940a514f2974405b579ff8989f9a0b6",33691743,33691743,55773517,"2038-01-18 00:00:00"

I loaded the CSV files to a PostgreSQL database, and then do a query. These projects are,

ghtorrent=# SELECT id,url FROM projects WHERE id=18281400 OR id=18533076 OR id=55773517;
    id    |                              url
----------+---------------------------------------------------------------
 18281400 | https://api.github.com/repos/cirosantilli/test-streak-puppet2
 18533076 | https://api.github.com/repos/phy1729/illacceptanything
 55773517 | https://api.github.com/repos/nhm150030/jfox
(3 rows)

Having compared the dates in the CSV file with the commit log in these projects. The dates are indeed wrong. Similarly, I also found these,

$ grep "2028-12-31 23:00:00" commits.csv
239657882,"182ef308602eabf227265eb2a5034ede567dccfa",4498938,4498938,18281400,"2028-12-31 23:00:00"
242561682,"cd952389edd54aa00cd3a98382da1193a1fc6d36",1340592,1340592,18507496,"2028-12-31 23:00:00"

In fact, if we group by the duration of commits of projects where the duration is the years between the oldest and the newest commits of a project and count the number of projects in a group, I got the following,

 duration | count
----------+-------
        0 |   492
        1 |   632
        2 |   414
        3 |   363
        4 |   325
        5 |   321
        6 |   277
        7 |   214
        8 |   202
        9 |   150
       10 |   102
       11 |   106
       12 |    89
       13 |    82
       14 |    93
       15 |    60
       16 |    63
       17 |    51
       18 |    49
       19 |    38
       20 |    33
       21 |    24
       22 |     8
       23 |    11
       24 |    14
       25 |    13
       26 |     8
       27 |     4
       28 |     1
       29 |     2
       32 |     1
       33 |     1
       34 |     3
       36 |     1
       38 |     2
       40 |     3
       42 |     1
       43 |     1
       44 |     9
       45 |     7
       46 |    15
       47 |    10
       48 |     2
       49 |     7
       58 |     1
       68 |     2
(46 rows)

where count is the number of projects whose duration is given on the left column.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions