System Design
Web Crawler
Web Crawler
Design a web crawler that can extract data from the web and store it in a database. The system should be able to handle large volumes of data and be able to scale as the amount of data grows.
Example:
Suppose we need to build a system that can extract information about products from an e-commerce website, including product names, prices, and descriptions. The system must be able to extract data from 10,000 pages and store it in a database for further analysis.
Functional Requirements:
- The system should be able to crawl web pages using a list of URLs as input.
- The system should be able to extract data from web pages using regular expressions or other data extraction techniques.
- The system should be able to store the extracted data in a database for further analysis.
Non-functional Requirements:
- The system should be able to handle a large volume of data with at least 10,000 pages to crawl.
- The system should be able to scale horizontally to handle larger volumes of data.
- The system should be fault-tolerant and able to recover from errors.
- The system should be able to run continuously with minimum downtime.
- The system should be able to handle a high volume of requests without impacting the performance of the web server being crawled.
- The system should be able to extract data from web pages within a reasonable time frame.
Assumptions:
- The system assumes that web pages being crawled are publicly accessible and do not require authentication.
- The system assumes that web pages being crawled do not require JavaScript to render.
- The system assumes that web pages being crawled are in a format that can be easily parsed and extracted using regular expressions or other data extraction techniques.
Estimated Usage:
- The system is designed to crawl at least 10,000 web pages.
- The system should be able to handle at least 1,000 requests per minute.
- The system should be able to extract data from web pages within a reasonable time frame.