Haystack the Search Tool You Didnt Know You Needed
On a daily ground, millions of terms are entered into the Wikipedia search engine. What comes dorsum when people search for those terms is largely due to the work of the Discovery team, which aims to "brand the wealth of knowledge and content in the Wikimedia projects easily discoverable."
The Discovery squad is responsible for ensuring that visitors searching for terms in different languages current of air up on the correct results folio, and for continually improving the means in which search results are displayed.
Dan Garry leads the Backend Search team which maintains and enhances search features and APIs and improves search result relevance for Wikimedia wikis. He and his team have a public dashboard where they tin can monitor and analyze the impact of their efforts. Yet, they exercise much of their work without knowing who is searching for what—Wikipedia collects very footling information virtually users, and doesn't connect search information to other data like page views or browsing habits.
Dan and I talked virtually how the search team improves search without knowing this information, and how dissimilar groups of people on Wikipedia use search differently. An edited version of our conversation is below.
———
Mel: You mentioned in an before conversation we had that power editors utilise Wikipedia'southward search in a completely different way than readers. What are some of the ways that power editors use search?
Dan: Power users use search as a workflow management tool. For example—they might see a typo that annoys them or a word in an commodity that is misused a lot, or be looking for old bits of lawmaking that demand to exist changed, and then search for that to meet if corrections can exist made. In that example, unlike your average user, they're actually hoping for nil results from their query, considering it means the typo isn't present anywhere.
Another way that ability users might use search is to await for their usernames because they might want to find places where they've been mentioned in give-and-take—and they want to "sort pages by recency" so that they can see the most contempo times they've been mentioned.
That represents a divergence from someone who simply wants to find an article. Our ability users aren't always trying to notice an article—they're trying to detect pages that see sure criteria and then they can perform an action on those pages. They're interested in the whole results set, rather than i-ii results.
———
Mel: It sounds like power editors don't always want or demand relevancy. (Although I'm sure sometimes they do.)
Dan: That's correct. It's something nosotros'd like to study more than in-depth. Nosotros prioritize relevancy for readers but editors and fifty-fifty some kinds of readers might need something completely dissimilar.
———
Mel: There are a lot of ways to search Wikipedia. Off the summit of my head, I can think of searching through search engines, through wikipedia.org, through an private commodity page, and then on the mobile apps. Do you lot notices differences betwixt all of these unlike pathways into the site?
Dan: Occasionally we exercise. I used to exist a production manager for mobile and I was focusing a lot on search. I was interested in search as an entry point for the mobile app.
But nosotros found that a lot of people were having problem with things like finding the search tool. We had fabricated an assumption that keeping a search query in the search bar would be useful for the end user, but people thought that was the championship of the page, and they were really confused.
When nosotros realized that this could be an result, we did a lot of qualitative user studies with people, and asked staff who weren't on the production team what they thought. It was helpful to become perspectives of this feature on the app outside of the dev team, from actual users.
We decided to change the way that search appeared in the app once a page loaded. When people navigated to that folio, nosotros deleted their search phrase from the search box which helped people know where to await to start searching once more.
We've also thought quite a chip about images and their human relationship to search. We thought about adding images in search results, and we found that adding images to the search results changed user behavior quite a bit. Instead of clicking on the first link, which may or may not have been the well-nigh relevant result, they would almost e'er adopt manufactures with pictures, even if the articles were further down the search results folio. We asked why, and people said that they felt that the result was more comprehensive or complete.
It'due south funny how changing something small tin immediately have a huge result. When nosotros made the picture change, we also saw a small driblet in people clicking through to the manufactures. This alarmed us because we thought we were enhancing things for the finish user, and we were worried that by adding the pictures, that nosotros may take inadvertently acquired them to not get the information they needed. But we did some earthworks, and found information technology was the opposite: for some queries, the answer to the search query was given in the search results so they didn't need to go to the commodity. Nosotros were meeting their user needs earlier in the search process which was fantastic.
You actually demand both quantitative and qualitative data to truly understand all the ways users utilize your product. Having either only i or the other can paint an unclear picture.
———
Mel: What kinds of things do y'all retrieve about when thinking nearly relevancy?
Dan: This is a catchy topic. The fundamental approach assumes that you can break down relevance into an equation that aggregates different factors, and then produces results that are "the most relevant." That's conspicuously not always going to be the case. If I search for 'Kennedy,' I could be looking for the airport, or the President, or I might be looking for John Jr. or Ted. There is no single correct "most relevant outcome" for that query.
There's a multitude of different factors—nosotros used to use something called tf-idf to figure out what to surface in what order. tf-idf stands for 'term frequency—inverse document frequency', which combines measures of how much words are mentioned in one article with how much they're mentioned in the whole site.
And so if I were to search for "Sochi Olympics". The discussion "Sochi" is relatively rare, but the discussion "Olympics" is much more eatables, it knows that "Sochi" part of the query is probably the more important one, and that's how information technology finds the 2014 Winter Olympics article as opposed to other manufactures most the Olympics.
———
Mel: It sounds similar that would be challenging for words that accept multiple meanings.
Dan: That'south truthful and something nosotros think about a lot. If y'all go to Wikidata, and you search for life on the search folio, you get search results like: Life Sciences, the Encyclopedia of Life, IUBMB Life, Cellular and Molecular Life Sciences, the phrase slice of life, the video game Half-Life… just you don't become the item on the concept of something being living.
And that'due south because of the term frequency and inverse document frequency. A lot of the pages I just mentioned a lot of them have the term life in them. And, past coincidence, the item nearly life itself doesn't actually have the give-and-take life in it very often. Which means the actual result for life is far downwards, because information technology doesn't seem as of import equally the others, even though information technology is!
———
Mel: I imagine in that location must be ways to mitigate that.
Dan: We've switched to an algorithm Okapi BM25 instead of tf-idf – it'south a newer algorithm. (BM stands for All-time Match.) Basically, what BM25 says is that there isn't a huge departure between a term being mentioned 1000000 times and a term beingness mentioned 10000 times. Using the new algorithm and switching to a more than precise way of storing data about manufactures helped with the Kennedy problem a lot, considering it's paying less attention to how frequently the word Kennedy appears in other pages since it's used a lot in this page. Before John Fitzgerald Kennedy was on the second page of event, and at present he's about 7th or 8th in terms of results.
———
Mel: Does the site use BM25 everywhere?
Dan: Nosotros utilise BM25 on every Wikipedia that is not in Chinese, Thai, Japanese and other languages where words in a sentence don't accept spaces in betwixt them. Nosotros tested BM25 and it acquired a massive drib in the cypher results rate on the spaceless languages due to a problems in the way words are cleaved upwards, or tokenized. We learned the algorithm wasn't working on those languages, and we deployed it everywhere else. Nosotros're hopeful that nosotros can gear up that problem for spaceless languages in the future.
———
Mel: What has been the most unexpected matter yous've learned through search?
Dan: There is a surprising long tail when it comes to the frequency of searches.
One of the offset things we were asked by our community members is "Why don't you make a list of the virtually popular queries that give zilch search results so editors tin make redirects or discover articles that need to exist written?"
The data is not that useful, as it turns out. In our analysis of the problem, some of the most pop zero result searches were "{searchTerms}" and "search_suggest_query" which we think are bugs in certain browsers or automated search systems.
We also plant that a lot of people were searching for DOIs, which are digital object identifiers used by bookish researchers. Most of the searches for those got zip results. We had to ask ourselves "What are people doing?" And we establish there was a tool that let researchers put a DOI into information technology to see whether their paper was cited in Wikipedia. Of class, virtually papers that people are searching for aren't in Wikipedia, and then it'southward actually correct to give them zero results!
When I started in search, we believed that users should never get naught results when searching. But information technology turns out that a lot of people were searching for things we don't have and information technology'southward correct to give them null results.
———
Mel: I know that Wikipedia has a very strict privacy policy and tracks hardly anything. What do we collect?
Dan: We do track some info. We accept event logging that says 'This user with this IP clicked on the fourth result, it took us this long to give them results', so on. Merely, it's Wikimedia'southward policy to delete all personally identifying information subsequently ninety days. That is a very intentional thing we decided to protect user privacy.
If you don't want information near users to be revealed, the merely thing y'all tin can practise is to not record it. If we get subpoenas, we are legally required to comply with. But if we don't have that information, nosotros patently tin't give information technology out! So it's the safest way to keep users' privacy protected. We can figure out some things past language, but not geography.
But information technology's catchy sometimes. A good example of that within the Latin alphabet is the search term "paris". What language is that in? Is it English? French? If I search for "cologne", information technology's a metropolis in Germany just also a perfume in English. And that's an case of relevance. Is a user who searches for "cologne" searching for a fragrance or a city? These things brand delivering proficient search results really difficult, but we go along on trying, and proceed making them a piffling ameliorate every day.
Melody Kramer, Senior Audience Development Manager, Communications
Dan Garry, Lead Product Managing director, Discovery Product and Analysis
Wikimedia Foundation
Source: https://diff.wikimedia.org/2017/03/14/building-wikipedia-search/
0 Response to "Haystack the Search Tool You Didnt Know You Needed"
Publicar un comentario