Result: Improvisation of Crawling on Web Maps Using Distributed Algorithms with Socket Programming Model
Further Information
This research develops a distributed web crawler system to enhance the efficiency of data collection across multiple devices using socket-based communication and a master-slave architecture. The system employs a tracker to manage peer connections, including devices with public and private IP addresses, utilizing UDP hole-punching for peer-to-peer communication behind NATs. A load balancer ensures equitable distribution of URLs among crawler nodes, minimizing duplication and optimizing workload. The architecture consists of a tracker, a manager, and multiple clients, with private clients performing the crawling to mitigate restrictions like rate limiting. Testing over a one-hour period with the initial URL "https://www.detik.com/" demonstrated that the distributed crawler collected 9175 unique URLs, a 30% increase compared to 7069 URLs by an individual crawler, with no data duplication. The system achieves improved resource efficiency and data optimization, though it is influenced by network latency and coordination overhead. This distributed approach significantly enhances crawling performance and scalability compared to single-device crawlers.