Stories from ML engineers: Competitive Intelligence

This is a new post that discusses how AI is used in everyday life.

We gather various stories from ML engineers who develop real-world products and ask them about the nature of their work, common activities associated with the role, and specific technologies that enable them to perform their jobs more effectively.

Competitive intelligence may interest companies wishing to increase their competitiveness, for example by comparing their prices for certain products with those of competitors. For this, data is collected from all possible open sources: sites are parsed, price tags are photographed in offline stores, etc.

The main task

Goods Matching

When the raw data is collected and transferred to the text form, you need to compare the goods from the customer’s data with the goods collected from open sources. This is where ML begins.

Typical subtasks

Extracting brand names

The complexity of the task lies in particular in the fact that employees of different companies can write the brand name in both Latin and Cyrillic, while naturally not following any rules. One of the possible solutions is to train a character-by-character LSTM model. The source of the data lies on the surface – articles from Wikipedia describing the borrowed term. From them, you can easily get the term in Russian and the corresponding term in a foreign language.

“Deciphering” abbreviations

Of course, when making records for internal use, employees of organizations tend to use various abbreviations. And this problem also has to be solved with the help of ML. Fortunately, in this case, you can generate as much training data as you like, simply by taking the text without abbreviations, shortening it arbitrarily, and training the seq2seq model to generate a “recovered” string.

Estimation of the probability of “garbage”

One of the “pleasant” gifts from the employees of the organization is to leave in the database an entry typed with someone’s ass =). As a result, it becomes necessary to identify text that, in principle, does not make any sense. To solve this problem, perplexity is used – the reciprocal of the probability that a given sequence of words can be found in the text.