[Dev] mirrors needed

Michał Masłowski mtjm at mtjm.eu
Mon Mar 5 18:42:07 GMT 2012


>> In package transfer or any networking of the repo server?  Is the
>> increase caused by malicious bots crawling our site or more users often
>> updating their systems?
>> 
>> Having specific data on this would show if encouraging using different
>> mirrors can solve the problem.
>
> we anonymize ips on logs so user agents may give you an idea
>
> block the bots!

My script for processing the log is at [0].  The log file I used starts
and ends with these lines:

127.0.0.1 - - [04/Mar/2012:03:45:11 +0000] "GET /isos/i686/parabola-2011.09.01-core-i686.iso HTTP/1.0" 206 45260 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)" "-"
[...]
127.0.0.1 - - [05/Mar/2012:17:21:39 +0000] "GET /~lukeshu/os/x86_64/~lukeshu.db HTTP/1.1" 304 0 "-" "pacman/4.0.1 (Linux x86_64) libalpm/7.0.1" "-"

This is the output of my script on this log:

1107005 /skins Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/532.4 (KHTML, like Gecko) Qt/4.6.3 Safari/532.4
1336577 REPO Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20120207 Iceweasel/10.0
1787250 /pool Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
1911287 /pool Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
1924266 REPO Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
1938651 /index.php?title=Special:RecentChanges&feed=atom Mozilla/5.0 (X11; Linux x86_64; rv:10.0.2) Gecko/20120219 Thunderbird/10.0.2
2468956 /docs Wget
2651128 REPO SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
2662872 /other Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot at gmail.com)
3568939 /index.php?title=Special:RecentChanges&feed=atom Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feed-id=17208844250925112595)
3840112 REPO curl
3976814 REPO Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)
4384389 /sources Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
6889936 /other curl
11755784 /other DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
11968356 REPO Axel 2.4 (Linux)
16766866 /isos DoCoMo/2.0 N905i(c100;TB;W24H16) (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
18488177 /other SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
19950376 /isos Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11
64949057 /isos SAMSUNG-SGH-E250/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1 UP.Browser/6.2.3.3.c.1.101 (GUI) MMP/2.0 (compatible; Googlebot-Mobile/2.1; +http://www.google.com/bot.html)
77276990 REPO Wget
123817802 REPO PackageKit
173015040 /isos Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
184574925 /isos Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.27) Gecko/20120216 Firefox/3.6.27
189289071 REPO Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot at gmail.com)
215261516 REPO aria2/1.14.2
272698752 /isos Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0a2) Gecko/20120304 Firefox/12.0a2
277916052 /isos Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.53.11 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10
278714862 /isos Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
280486126 /isos Mozilla/5.0 (X11; Linux i686; rv:10.0.2) Gecko/20100101 Firefox/10.0.2 Iceweasel/10.0.2
545259520 /isos Wget
650556670 REPO Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
1241702558 REPO Python-urllib
14725983912 REPO pacman
All bytes: 19451591920, bot bytes: 1153843961 (5.93%).

A trivial modification of the script sums /isos as 2115913496 (10.88%).

Things not shown by the script:

- lines with small data size – it would be too much to show, and it's
  mostly useless since the script lists each wiki article on a separate
  line

- the bot sum includes only honest bots which don't claim to be MSIE or
  other browsers

Things not logged:

- accesses not done between March 4 and 5; I assume these dates aren't
  untypical

- other data than HTTP response data size

My recommendations:

- add a /robots.txt file blocking all bots from anything on
  repo.parabolagnulinux.org

- remove ISO images unless there are users who cannot use torrents or
  other mirrors

- block bots not respecting robots.txt by user agents if future logs
  will show them having big traffic here

- promote using other mirrors

Unlike other sites, repo.parabolagnulinux.org doesn't need to be indexed
by search engines, so there should be no problem with blocking bots.

[0] https://mtjm.eu/patches/log_counter.py
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 835 bytes
Desc: not available
URL: <https://lists.parabola.nu/pipermail/dev/attachments/20120305/07c7edeb/attachment.sig>


More information about the Dev mailing list