It is the process of parsing a piece of text and discovering named entities that might exist in it. Entities can be of different types like countries, companies, people, and so on.
This app uses a fine-tuned model, where OpenAI's ChatGPT has been trained using a thousand examples of prompt/completion pairs, where the prompt is an article, and the completion is a list of entites, as well as their Wikipedia article URLs.
The training dataset is based on actual Wikipedia URLs, where we benefit from the work of hundreds, if not thousands, of people who have worked on, and participated in determining which entities exist in a certain article. You can learn more about how to use fine tuning to create an entity extraction app if you're interested.
Galileo Galilei was a great scientist. He made great discoveries, one of them was about the planet Jupiter.
Entity | Wikipedia URL |
---|---|
Galileo Galilei | https://en.wikipedia.org/wiki/Galileo_Galilei |
Jupiter | https://en.wikipedia.org/wiki/Jupiter |
Content can be made easier to understand by computers/search engines if we help them by structuring our data enabling them to mechanically extract data and understand the content.
Search engines are smart and can pick up many such entities on their own. But it is still a probabilistic approach, and costs much more than simple retreival. Also, while this process can be fairly easy for very well-known entities like big countries or famous brands, it becomes more difficult for less-known entities. For example "apple" is fairly easy to figure out, because it is a very famous brand. But what about "pineapple"? Is it a company, brand, app? We can make it easier for search engines to learn about this entity by provding strucutured data about it.