Web Data Extraction

A simple web crawler and scraper has been developed to extract data from websites. The crawler operates on Scrapy, while the scraper utilizes BeautifulSoup. At present, the scraper is configured to extract structured data specifically from the Trustpilot.com website. For other sites, the scraper is designed to return unstructured text data.

Current features

  • [✅] extract_from_url API:

    • This API enables text extraction from a list of any website URL, utilizing the trafilatura library for text content extraction.
    • It offers the ability to recursively navigate within a webpage, staying within the same domain. Users can specify this using the max_next_pages parameter (where 0 denotes only the current URL, and a higher number allows navigation through multiple pages)
  • [✅] extract_from_trustpilot_url API:

    • This feature focuses on extracting formatted content from a specific organization’s page on trustpilot
    • It also includes functionality to navigate through subsequent pages to gather all reviews, controlled by the max_next_pages parameter.

Project link: https://github.com/sarapiscitelli/web-data-extraction

Nifty tech tag lists from Wouter Beeftink