Miningleaks: How Julian Assange spent a summer coding in PostgreSQL

Years before Julian Assange became famous because of Wikileaks, he was wasting his time participating in open source. He is a contributor to PostgreSQL, among other projects. We show here what he did in Postgres, and when and why he did it.

Table of Contents

The contributions of Julian Assange

After importing the data contained in the pg-who.txt file, and grouping by developers, we realized that, surprisingly, the name "Julian Assange" appeared in the commiters list. Next table shows some selected commiters from the PostgreSQL Git repository (we have removed some of them for more brevity). Notice one of the last lines. Yes, it's him, Julian Assange.

NameCommits
Bruce Momjian12239
Tom Lane9131
Peter Eisentraut2166
Marc G. Fournier1491
Thomas G. Lockhart1078
Julian Assange6
Vince Vielhaber2
Kris Jurka1

The previous table was obtained by importing the CSV file to a PostgreSQL database (we would not dare to use MySQL to study PostgreSQL, of course), and then grouping and sorting by number of commits.

To create the table, we recreated the header of the pg-who.txt file with

create table commiters (who text, day text, month text, dayno int, hour int, minute int, sec int, year int, offs text);

and then imported the file with

copy commiters from 'pg-who.txt' with delimiter as ';' csv header;

To obtain the data we used the following query

select who, count(*) as commits from commiters group by who order by commits desc;

When, where, what and why?

When did he participate?

When did he start to work in PostgreSQL? The pg-who.txt file shows that he started to work on July 25 1996, and he did his last commit on August 21 1996. The data of the different commits is the following:

Commit idDate
99dc4e3b4335a1b927fc87878089cde4e07a808cWed Aug 21 00:22:41 1996
dfca092633dae8a6f2e5a3db859fe3faec6f2de6Tue Aug 6 20:23:14 1996
7ef04b25cc59f87abf80ec25c3c1b229bf77a37eTue Aug 6 00:40:12 1996
ed3240d093e0fda4a8caa33b85a2636b6400c80aSun Jul 28 06:48:42 1996
76bc8cb97fc35673c42fe84fe6a9d6887260419aThu Jul 25 06:46:35 1996
23c7ff0b3c38657081590430d08ef48bb5bde759Thu Jul 25 06:21:11 1996

To obtain this information we just used grep in the file meta.txt, using "Julian Assange" as pattern. The command is

$ grep "Julian Assange" meta.txt 

Then we replaced some parts of the output of the command inside Emacs, to add the symbols for the above table.

What did he do in PostgreSQL?

If we grep for each one of the previous commits, we can obtain which files he touched, and why he made the changes (by using the log text of the commits).

His first commit

There is always a first time for everything. Julian's first contribution to Postgres was changing a comment:

@@ -7,7 +7,7 @@
  *
  *
  * IDENTIFICATION
- *    $Header: /cvsroot/pgsql/src/interfaces/libpq/fe-exec.c,v 1.5 1996/07/23 03:35:13 scrappy Exp $
+ *    $Header: /cvsroot/pgsql/src/interfaces/libpq/fe-exec.c,v 1.6 1996/07/25 06:21:11 julian Exp $
  *
  *-------------------------------------------------------------------------
  */
@@ -761,7 +761,7 @@ PQdisplayTuples(PGresult *res,
  * PQprintTuples()
  *
  * This is the routine that prints out the tuples that
- *  are returned from the backend.
+ * are returned from the backend.
  * Right now all columns are of fixed length,
  * this should be changed to allow wrap around for
  * tuples values that are wider.

The first line was not actually made by Julian himself, but it was probably automatically changed by CVS, that was the repository used at the time by PostgreSQL.

The second line changed is only a comment.

More commits

In the next commit (76bc8cb97fc35673c42fe84fe6a9d6887260419a) he added more than 500 lines of code (!). In the log text, he admits that it is a "large re-write/enhancement".

Julian submitted a patch (that he calls "proff") that was only partially applied to PostgreSQL. In this commit, he added the rest of the patch, and also improved the code by removing redundancies. Finally, he warns that the code needs more testing.

These two first commits were done the same day, with just 20 minutes between commits. It is clear that the code had to be previously written (or in the other hand, that Julian is a hacker who is able to write 500 lines in 20 minutes, beating any previous known programming productivity record).

The next commit, ed3240d093e0fda4a8caa33b85a2636b6400c80a, was done on Sunday, July 28 1996, and it is apparently a trivial change. However, when examined in context, we can see that if the "NOREADLINE" constant is not defined, the "else" directive will be orphan, and therefore the program will not compile. So he fixed a potential bug. He annotated the log as a "bugfix", but it is unclear whether the bug was reported, found by Julian, or even introduced by himself in the previous (and large) commit.

He then merges someone else's patch, and then slightly modify the code on August 6.

Finally, on August 21, he makes a very small change.

These commits explain where and why Julian participated in the project.

The Assange's brotherhood: who else has touched the same files?

This is very easy to retrieve. One possible option is to look at the log of the files touched by Julian, using git log. All the files where Julian has contributed more were only touched by Marc G. Fournier as well. There is word of warning here: it is usual in the PostgreSQL project that a committer submits someone else code, so it could happen that Marc is only adding third parties code, and therefore Julian had to coordinate with more people to change the file.

It is worth noting that Julian himself added third parties patches in some of his scarce commits. In principle, we could think that this minorates the participation of Julian in the project, but in the other hand, we think it means precisely that Julian was close to the core of PostgreSQL; otherwhise he should not had been allowed to commit patches to the main branch of the repository.

Conclusions

This is a very brief study based on some preprocessed information extracted from the Git repository of PostgreSQL, the well-know database manager software. We used standard tools like grep, and also SQL by importing the CSV files into a PostgreSQL database.

But the most interesting methodological point is about how we found the "diffs" and logs of all of the Julian's commits. It could be easy to get that information directly from git log. But it is even easier to get it directly from the web using a search engine. Because Git commits are identified by a (supposedly) unique hash string, the amount of results that Google search returns is small and very accurate. Moreover, it does not only find the commit in the git repository, but also information about the commit in other places. This includes mailing list archives, and also other researchers' work. One of the authors (IH) of this report had always found very annoying how long and meaningless were commit ids in Git. He has now changed his mind.

Some questions remain open though. If Julian was close to the core of the project, why did he do only six commits? We have also tried to find trails of Julian's activity in mailing lists and other PostgreSQL repositories, without success (we could find activity in other lists, like FreeBSD and NetBSD, but none in the PostgreSQL lists). Why did he dissapear from the project after just one month? Why there is not mailing list activity regarding his contributions to the project?

In fact, last year, when PostgreSQL migrated to Git, the administrators asked for confirmation about the email addresses of all the developers. It was at that time when the current core developers realized that the Julian Assange who was famous because of Wikileaks was also a contributor to PostgreSQL. The developers even comment that they do not remember him contributing to the project, and wonder whether it is a good idea to make public (was not it public already?) his contributions to the project. They also comment about some other usernames that might be atributed to Julian, which we have not considered, but which do not affect our main question either: why the developers of PostgreSQL do not remember him? Could this mean that a new generation of developers has taken over the project and forgotten its history? It is severe that developers forget the history of a project; it seems that they do not look at the archived information, unless explicitly necessary, like in a migration from a kind of repository to another.

This is just an example of how proper mining tools that offer historical information in a useful and easy way, and without disturbing the usual developers' workflow, are yet to be applied in the PostgreSQL project, and also probably, to be "discovered" in our research community.

Date: 2012-05-11 16:25:31 PDT

Author: Martin Beck, Israel Herraiz

Org version 7.8.09 with Emacs version 23

Validate XHTML 1.0