webhose provides articles scraped from news sites and the parsing of these sites is used for Watson NLU enrichments
Some urls contain additional articles that are extraneous to the main article. The parser includes this extraneous text as part of the main article text. Subsequently this contaminated text is enriched within Watson Discovery and the NLU results include entities from the extraneous text. This situation results in articles being tagged as very relevant to an entity that is not related in any way to the main article of the URL and provides a False Positive match to a query request. In additon, these extraneous articles can change over time so the exttraneous articles present at the time of scraping are no longer present on future calls to the main article URL.
Here is a specific example, query articles with IBM & Zillow as entities with the keyword - patent. The returned articles include articles that are not relevant to these entities
URL of returned article queried per above: https://www.law.com/2020/08/31/how-a-trial-lawyer-survived-a-14-hour-zoom-hearing/?slreturn=20200810142110
|Who would benefit from this IDEA?||All users of WDN as Flase Positives will be reduced|
NOTICE TO EU RESIDENTS: per EU Data Protection Policy, if you wish to remove your personal information from the IBM ideas portal, please login to the ideas portal using your previously registered information then change your email to "email@example.com" and first name to "anonymous" and last name to "anonymous". This will ensure that IBM will not send any emails to you about all idea submissions