Web Crwaling for Varibute - A Case Study

Background:

Varibute required a comprehensive data extraction solution to gather commodity data from the website agmarket.gov.in. The project aimed to collect extensive data on commodity prices and arrivals from various states and markets to support their business operations and decision-making processes. The total data expected to be collected was around 35 million records.

Business Challenges:

Varibute faced several challenges in obtaining the necessary data: 

  • Data Volume: The need to extract and manage a vast amount of data (35 million records) posed significant logistical and technical challenges. 
  • Data Accuracy: Ensuring the accuracy and relevance of the extracted data was crucial for business operations. 
  • Efficiency: Manual data extraction was not feasible due to the volume and complexity of the data, necessitating an automated solution. 
  • Integration: The extracted data needed to be formatted and stored in a way that could be easily integrated into Varibute’s existing systems for analysis and reporting.

Solution Proposed:

To address these challenges, the following solution was proposed and implemented: 

  • Crawler Development: A crawler engine was developed using JavaScript and jQuery to automate the data extraction process. The crawler was designed to navigate the agmarket.gov.in website and extract the required data efficiently. 
  • System Deployment: Three systems were deployed in the office, each equipped with Chrome browsers and the necessary crawling extensions. 
  • Data Management: Extracted data was stored in the cloud and provided to the client in JSON format to facilitate easy access and integration. 
  • Continuous Communication: Regular updates on the status of the crawling activity were provided to the client via WhatsApp to ensure transparency and address any issues promptly

Technology Stack:

  • Programming Languages: JavaScript, jQuery 
  • Web Browser: Google Chrome with custom crawling extensions 
  • Data Storage: Cloud storage for maintaining commodity data, JSON for data access and integration • Communication Tools: WhatsApp for real-time updates and client communication

Business benefits:

  • Efficiency and Scalability: The automated crawler significantly reduced the time and effort required to extract large volumes of data. The solution was scalable, allowing Varibute to handle future increases in data volume without substantial additional investment. 
  • Data Accuracy and Consistency: Automated data extraction minimized the risk of human error, ensuring high accuracy and consistency in the collected data. 
  • Real-time Updates: Regular updates to the client facilitated prompt issue resolution and ensured that the project stayed on track. 
  • Improved Decision-Making: Access to comprehensive and accurate commodity data enabled Varibute to make informed business decisions and optimize their operations. 
  • Cost Savings: Automation reduced the need for manual labor, resulting in significant cost savings in terms of time and resource