Keyword - database

Entries feed - Comments feed

Monday, April 28 2008

Favion Archive Update #1

I recently added one dedicated server to the Favicon archive project. It will handle two main duties : serving a fast Postgresql 8.3 DBMS with 1GB physical memory on board and eventually plugs into the main Favicon server as a file server using NFS.

The new dedicated DB server is incomparably faster than the shared one (256MB vs 1GB) and allowed the spawn of many more crawler processes. I'm running with 34 currently  instead of 6 previously.

Side effects : hard drive storage will run slow much sooner :D That why I'm wondering about sharing the new server hard drive which will hopefully adds about 200GB to the file archive.

Here is a summary of the Favicon Archive after one month exploitation:

  • Favicon files : 4,3GB
  • Cache files : 41,0GB
  • Database size : 3,0GB

Crawler stats 5 minutes (1 month history):
  • Average Homepage fetched  : 237
  • Average Favicon saved  :  122
  • Total sites discovered : 4 M
  • Total saved icons : 728 K
This is about 1 favicon found every 2 web sites fetched and 1 favicon found every 5,5 sites discovered.

At this rate, the system will cope about 8,7M favicons in the next year (generating 51GB icons data and about 486GB cache files), this will results in about 48M web sites in the database and I will eventually reach my first billion websites discovered in about 21 years (for about 1TB icons 11TB cache files ) :D

Still at this rate, I'll will run out of space and shut down all the crawlers in about : 3 months and 3 weeks


Add your site before closure !



More seriously, I'll try to implement some interesting features like :
  • Homepage ownership using special meta tag that will allow you to edit your keywords, tags and description.
  • Icon search by colors.
  • Community and voting system


  

Tuesday, April 15 2008

Introducing : moBlur's Favicon Archive Project

All hail to  the world's first Favicon Archive!


In the next days you may notice yet another crawler visiting your site; identified by
"FaviconArchiver/1.0 (+http://moblur.org/workshop/favicon_archive/)" user agent, it will gather your homepage (and only this page unless redirected) and save your favicon to moblur's dedicated database. Once crawled, if your site does have a favicon, it will appear on our index page and on the search engine. A dedicated page will also be directly accessible and will display a summary of your site based on meta tags and a link to your domain.


This project is born from curiosity in large database driven web applications. How to scale, how to optimize, how to deal with a huge database and gigantic filesystem entries were the main questions i was asking myself.


The project itself is far from being optimal.

  • The crawler, written in PHP 5 can be better, faster... in fact, it could have be written in C++ ...
  • The database  system recently upgraded from PostgreSQL 8.1 to 8.3 still needs some fine tuning ( Thanks Phil for your precious help and advices on this :D ).

So far, (a couple weeks) so good, the crawler discovered 2.5M unique domains (subdomains counts too)  and saved  1/2 M  Favicons on the filesystem (ext3 on debian stable)

Current room for the application :

  • CPU 1.20GHz
  • RAM 256 Mb
  • HDD 144G
All dedicated (well almost dedicated) to the favicon archive.

My goal now is to optimize the application to keep it inside this nutshell for the longest time possible.