These tar files contain hourly page requests for all public Wikimedia wikis All page requests are included, whether for editing or reading, no matter if the page exists or not. Since May 2015 a newer data stream, with bots and other non human traffic filtered (aka webstatscollector 3.0), is aggregated at https://dumps.wikimedia.org/other/pageviews/ (but for older years this folder is still used to generate reports) webstatscollector 1.0: Info: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-raw Source: udp messages from squids Aggregator: http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ (Author Domas Mituzas) Packaging: DammitSyncProjectcounts.pl on WMF wikistats server (patching and packaging by Erik Zachte) webstatscollector 2.0: Info: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pagecounts-all-sites Source: hadoop / hive Packaging: DammitSyncProjectcounts.pl on WMF wikistats server (patching and packaging by Erik Zachte) webstatscollector 3.0: Source: hadoop / hive See readme.txt at https://dumps.wikimedia.org/other/pageviews/ FILE CONTENT Here are a few sample lines from one file: eml - 134 2336748 en - 8798556 205294255505 en.b - 12011 160444865 en.d - 72411 651799702 en.mw - 693098 61018547525 en.n - 6976 106480601 en.q - 15276 261954372 en.s - 10575 85471409 en.v - 4433 34335326 eo - 14476 240604800 In the above, the first column "en.b" is the project name. The second column is always a dash, the third column is the total number of page requests, and the fourth column is the size of the content returned. When lines have been patched (see below), the fourth column always shows a 1. The following abbreviations are used for projectname: (see html info above for details) wikibooks: '.b' wiktionary: '.d' wikimedia: '.m' (for special wikis) wikimedia mobile: '.mw' (for all projects combined) wikinews: '.n' wikipedia: (no suffix) wikiquote: '.q' wikisource: '.s' wikiversity: '.v' wikivoyage: 'voy' mediawiki: '.w' webstatscollector 2.0 introduced suffixes '.m' for mobile and '.zero' (and kept .mw for mobile traffic to all wikis combined combined for backward compatabitlity, but .mw is now redundant and deprecated) for special wikis mobile and zero traffic are found at suffixes 'm.m' and '.m.zero' REFRESH RATE New projectcounts files are collected and added to the tar file at least once per day File 'most-recent-file.txt' contains the name of the most recent projectcounts file in the archives. PATCHED FILES There have been periods where the WMF servers could not handle all log traffic. These situations were detected only after months of underreporting, which could be as high as 20%. (Since late 2011 there is better detection of server underreporting which should now be caught very early) Serious underreporting occurred from Nov/Dec 2009, Jan/Feb/Mar/Apr/May/Jun/Jul 2010 and Aug/Sep/Oct 2011. Wherever possible the projectcounts files in these archives have been patched. Underreporting could be corrected as follows: The amount of underreporting could be deduced from average gaps between message sequence numbers in the 1:1000 sampled squid log. When for some hour the weighed average gap between sequence numbers for all squids was 1250 instead of 1000 this meant counts should be 1250/1000 higher for that projectcount file. See also: http://svn.wikimedia.org/viewvc/mediawiki/trunk/wikistats/squids/SquidCountArchive.pl (collect gaps in sequence numbers) http://svn.wikimedia.org/viewvc/mediawiki/trunk/wikistats/dammit.lt/DammitPatchProjectcountsFromSquidStats.pl (patch projectcount files)