The headline news “One coding line put 200 programmers from a big data company under arrest…” sent a shocking wave into the tech geeky world. And the culprit of this whole scandal is called “data scraping”, a computer technique that is usually applying bots to automatically harvest a huge amount of data from any page on the internet.
The data extraction is often used for further analysis and having diverse use cases such as:
- Search engine optimization
- E-commerce price monitoring
- Social media listening
- Content aggregator
- Simply to enrich your existing database
To this day, despite the innocent nature of data scraping itself, many believe it is playing in the grey area and can be easily put into malicious purpose, especially its capability of extracting very personal and private information. In addition, it is actually not easy to determine the legality of web scraping in a digitized era.
Back in our headline story, it all started with an ordinary programmer working in a startup who received the demand from his supervisor to crawl massive online data from a local large company website. It went smoothly in the beginning and the programmer kept optimizing the technique to a point so efficient that eventually crashed the server of the large company due to his heavy load of illicit data retrieving. The tiny incident triggered the bomb and angered the large company which called the police to step in. The investigation immediately pinpointed the source to a startup selling CVs to anyone who could offer a proper price. But the startup was neither a recruitment platform nor headhunter agency; its database of over 160 million CVs is said partially acquired through scraping other major human resources related websites. The police raided this startup one day and handcuffed about 200 programmers who were stunned to know that data scraping would put them behind bars.
Chain of events followed with the regulator’s determination to crack down on illegally scraping and obtaining private data online. It is said several big data startups especially related to Fintech were soon implicated into scrutiny and question as well. Major banks reported to the policy that this type of Fintech startups are often collecting their important user information, again via scraping without banks’ proper permission.
True to form, a big data company has to wield certain power of scraping in one way or the other. Without scraping, you are like going to a formal dinner being shirtless. But today due to the stringent regulation on the horizon, few startups even decide to wash their hands off it.
One Chinese startup which is specializing in turning any unstructured text data into NLP analysis then generating consumer insights or customer feedback，might be an extreme example. The magic of all its analysis is essentially fed by the blood from scraping open & public user-generated content, in particular, comments from Alibaba e-commerce sites, social media sites Little Red Book, online travel site Ctrip, etc. In the past it built a solid internal crawling/scraping division to pump blood from the internet every day, however now it is asking its clients to bring their own data instead of scraping on behalf of them. If their clients insist on the scraping, it says it has to consult lawyers first to avoid any negative consequences. As harmless as these open data might seem on the surface, the startup said they are too small to embroil itself into any legal troubles. Not to mention it is well known that Alibaba has set up certain anti-scraping mechanisms to prevent outsiders from looting its open data abusively.
Having said this, not every startup would chicken out. One local startup which offers service in monitoring e-commerce intelligence for pricing, competitor store, sales value, sales volume, reviews, etc, still feels comfortable to scrape the data from leading e-commerce channels like Alibaba, JD or Pingduduo. It can pretty much update the intelligence data almost on an hourly basis. According to them, huge traffics on these e-commerce platforms daily would perfectly disguise their data scraping as normal website visits.
Today data scraping in China is a good-paying job with a monthly salary between RMB10K~60K, depending on your skills. You can always find underground shops to scrape raw data for you. But for those big data startups selling analysis, they probably need to tread the path more cautiously from now on.