Developed an extensible, distributed, scalable, bandwidth efficient crawler capable of scraping and extracting various entities from review sites, blogs, forums, social networks, etc. This was a time before the Scrapy project was started and Nutch was in it’s infancies.
Tech: Python 2.6, BeautifulSoup, Python MultiProcessing, MemcacheDB, PostgreSQL & Greenplum.
The crawler framework was developed so that, data from any new site can be scraped easily by writing a simple python code for each site. To write a scraper for a new site, the developer has to inherit from a base crawler class and override a fetch method with code specific for that site. It is also possible to fetch data from APIs also. Any new code for a new site will be dynamically loaded based on the URL that is being currently fetched.
The worker processes can be run on multiple machines and is distributed across multiple machines in a network. Each crawl node will fetch one URL and pass on the unstructured data to the next step in the pipeline
The crawler is horizontally scalable by adding more crawl nodes. The crawler had a single master node which maintained the list of all URLs to be crawled and send them to the crawl nodes.
Each day the crawler is supposed to fetch only updated to each page. Based on what type of site it is scraping, it maintains a history of the previous crawls and directly fetches from the correct thread/page in a long list of paginated sites. This was especially useful in fetching content from review sites and forums which had lot of pages.
After fetching the unstructured text from different pages, it passes the data through a pipeline of different entity extractors. It used a pluggable architecture where the list of entities to be extracted is config driven based on the client/site crawled. Entities include person name, organization, location, patent info, sentiment/opinion, biological and chemical entities, etc.