ABOUT CLIENT

A North American client engaged in providing an online directory portal.A company from USA who runs an online directory portal. They were working on some unique enhancements for their directory portal to make it stand-out. Their main concern was to get the data for the directory. The portal is meant to list all the business directories in different regions and get the maximum profiles published within a limited period of time.

Approach

The company approached ECIT; We discussed the possibility of getting genuine data from different sites. Scrapping was the only option. With the help of client we started researching all the prospective yellow-pages sites that list all the businesses in the major cities and picked 10 websites which we found prospective.

Choosing the best technology was the next thing that we needed to do. After a research we found that Python is the best language we can use, along with tools Splinter and Selenium. Splinter is an abstraction layer on top of existing browser automation tools such as Selenium which is written in Python. So splinter can do the automation better in some places where selenium can’t identify.

 

CLIENT REQUIREMENTS


Client required some unique enhancements for their business portal to make the portal stand out in the market. The main challenge was to get the data for their directory, where all information that were available in different regions and get the maximum profile published within a limited time period.

The data’s were scattered in many yellow pages site, these data’s needed to be pulled in to the directory portal , mainly the business name, address, geo location, description, images etc.

REASON FOR ENGAGEMENT

Time was a major concern for our client; they found it difficult to populate business listing from all over the internet. The normal registration procedure that picks and enrols the business into the system were not efficient the cost for making and marketing for the portal were sky high

RESULT

The result was surprising to the client. The process which was initially estimated for one month got completed with in 2weeks. We scrapped almost 99% of data from all those 10sites. Insertion is done in such a way that it directly inserts each of those details into client’s beta site.

STRATEGY

ECIT started working in all the 10 picked sites simultaneously. The approx calculation was to populate ten hundred thousand entries in a month. We purchased a server and start scripting the automation process. We faced some challenges in the start, like, to exclude all the duplicate entries feeding into the DB. We created triggers to identify and remove all the unwanted listings and restrict any future updates like that. We started running the actual scrapping process within 8 days. The scrapping process being a slow one which dependent on the site speed which we are scrapping, we created 50 browser sub-process that runs in parallel for each site.

Key Benefits to Our Client

17 Lakh Business profiles scrapped

Profiles populated with in 2 weeks of time

Restricted insertion of all genuine business profiles