Rebuilding a book's Catalog of a Publisher

GitHub Repository

Content of a Book’s Catalog of a Publisher

A publisher’s Catalog has all the data and metadata related to his books. For example:

  • ISBN
  • Title
  • Author
  • Dimensions
  • Pages
  • Summary
  • Translator
  • Illustrator
  • Type of Cover
  • Image Url

This publisher has all this information in his web page, but it was uploaded manually and he has not a unic file with the information.

Scrapping of the Data

To collect the data I iterate a list of ISBN (International Serial Book Number) to find the exact page in the web page. Then using Beautiful Soup, I extract the data for each page of each book. Subsequently I drop the information in a CSV using pandas.

If some ISBN or some Data has mistakes it passes to a list of ISBN with problems to check it mannually.

Project link: https://github.com/dfmoscoso23/bookcatalog

Nifty tech tag lists fromĀ Wouter Beeftink