Guest post by David Vizi, CTO @ emzio
We develop emzio (https://emzio.co), a web application that helps Amazon sellers find profitable products to sell on Amazon. Sellers request product lists from their suppliers and run them through emzio on a regular basis. emzio then enriches the data with relevant metrics and presents them in an easy-to-filter way. The product lists are usually Excel sheets, less often CSV files with information about each product such as barcode and wholesale price.
The problem is that many suppliers can't or don't want to send such lists. Instead, they point sellers to their wholesale webshop. The reason for this is usually technical. Sales managers are often not tech savvy and if they don't already maintain their inventory in a ready-to-send format they resort to advertising their web portal. Maybe we should introduce them to CRANQ, I wonder?
emzio needs well structured data to work so these suppliers are disqualified from our business model.
If only there were a way to turn webshops into CSV files...
Although there already are mature solutions for website scraping we decided to give CRANQ a shot and see how it performs. I'd never used any of these solutions to any extent worth mentioning before so some kind of learning curve was a given anyway. If I'm to learn something new why not learn something completely new?
Now, before we proceed, let me assure you of one thing. This procedure is desirable for suppliers. After all, it's their best interest to sell their inventory, isn't it? We always employ proper throttling so as not to turn our endeavor into a denial of service attack.
Before I give you my honest evaluation of CRANQ let me describe my background first. Backgrounds are important because CRANQ is different.
My background is mainly in imperative languages. Of course, I know all about other paradigms and got my feet wet with Haskell and J but the way I see problems has been largely shaped by development in imperative languages. This puts me in a good place to evaluate CRANQ from a typical software developer's perspective as most development still happens in imperative languages.
Let's also not forget that the experiment was done quite some time ago in a much more rudimentary stage of the project when features like instrumentation didn't exist and we used the console for debugging.
I'm not going to lie, the learning curve was steep. The documentation and search functionality weren't as good as they are today. And then there was the literal paradigm shift my brain had to undergo to tackle the problem. When everything is asynchronous by default you know you'll have to invert your perspective, too. I had to go back to the CRANQ team a few times to ask them how certain problems are best solved. It often turned out that CRANQ can do things out of the box I thought I would have to implement from scratch. Nevertheless, I did identify a few nice to haves which are now part of CRANQ (like the XPath evaluator node).
With some experimentation and the help of the team I finally finished the first scraper. Then I created a second one. The second one required less help from the team due to my growing experience, the improved IDE functionalities, and better tutorials. If you're starting your CRANQ journey now it will be much easier for you.
All in all, CRANQ proved to be a more than adequate tool for the job. What wasn't already supported could be added with ease. Traversing the HTML and adding login functionality were no-brainers. If you're a developer Cranq is worth learning just to expand your horizon. As for non-devs, I think for those completely new to programming whose way of thinking hasn't yet been derailed by machine friendly (and rather human unfriendly) imperative languages CRANQ is a better introduction to programming than say Java.
I've made two fully functional scrapers and plan to create more in the near future. Supplier portals are very similar so the problem domain is small and well defined. I want to extract a mini-framework that will help me write new scrapers with minimal effort. I'm at the point where using a conventional scraper would just slow me down.