Saturday, March 27, 2010

Taxonomy and Classification resources should sit on development teams

Little bursts of controversy seem to occasionally erupt in the blog-o-sphere that ignite the debate about the "worth'" of taxonomy, and by extension, of taxonomists. But the debate tends to focus on defining a taxonomist's job. I think that we have defined the position pretty well, but now we need to be sure that it sits in the right place in the organization. In my experience, both as an employee and as a consultant, the right place for an information retrieval professional is on the development team. And developing information retrieval artifacts (like taxonomies and classification rules) deserves one (or many) full-time resources; I have seen many good IR projects fail because IR duties are assigned to someone as an "as-needed" part of their day job.

Here is a little "typical taxo case study" that I wrote awhile ago to illustrate the point. Does this resonate? How would you staff the project?

Zathras Inc is a large consumer product information company. Privately held, it derives revenue from its Internet Web site: consumers purchase advertising on the site, as well as reports that its analysts generate. It also provides subject-based access (via subscriptions, RSS feeds, and the occasional tweet) to its topic areas of expertise. Zathras has been in existence since the Internet boom in the 90s, but only recently has begun to think about creating an inventory for its archive of reports, which exist in file directories on individual PCs, in Lotus Notes databases, and in a fledgling Sharepoint document store. Zathras also has an Intranet site to support its basic employee functions: HR, Payroll, Marketing and Company News, but the Intranet site is managed by a small IT team, and they have historically reported up to a different business group than the IT team who manage the Internet site. Recently, their CTO, also a founder of the company, stepped down, and the new CTO merged the Internet and Intranet teams. She also suggested that the team look into purchasing an enterprise search engine. The new CTO has recently read about “taxonomy” in the press, and wonders if the research team, which currently consists of two professional librarians, might have time to work on a taxonomy project. She’d like to see Zathras become a leader in the Web 2.0 space, and has hired a project manager to help her meet this goal.

Luckily, her project manager, Peter “Bert” Ono, has been involved in this field before, having come from a major computer company, and he realizes that he needs to scope the problem a bit. He quickly forms two teams, drawing on the existing expertise in the company. First, he asks his CTO if she can head up a team to research available “Web 2.0” technologies. He asks if she can provide specific use cases about how these technologies might work on Zathras’ Internet site. He asks her to include members of Sales and Marketing on this team, because he’d like these use cases to be based on real business problems. He also asks her to include an analyst who has been with the firm almost since its inception. The CTO is a bit surprised at Bert’s request, but agrees to include the analyst when Bert explains that the analyst’s historical knowledge will help shape the way content appears on the site. Bert feels that the analyst also has key insights about the way consumers really use their site.

Bert’s second team will concentrate on taxonomy development. Both Zathras’ Web site and their Intranet site include categories that are used for navigation, but since no one is quite sure if that’s what the CTO means by “taxonomy,” Bert realizes that his team needs to provide a definition. Bert’s second team, will include people from content development, IT, and the librarians from Research. The second team is charged to come up with a taxonomic structure that will be used for more than just site navigation; Bert hopes that he will be able to use the taxonomy to replace the keywords currently used in his Lotus Notes databases, and, eventually, he’d like to see the taxonomy improve the user experience with the new search engine. He works with IT management to include the same member of the team who will help to define requirements for Zathras’ enterprise search engine purchase, but the two initiatives are not aligned yet.

Given this charter, the taxonomy team begins its work. Once appropriate access has been granted, the team members fan out to create a content inventory. Part of the team surveys content formats, and finds that content exists in Word documents, Powerpoint presentations and PDF files, as well as in two legacy Lotus Notes database applications. One has been used in the past for internal discussion of consumer products, and the other is still used to generate content for Zathras’ Web site. The team finds, not unexpectedly, that metadata use has been sporadic in all these sources, especially in the documents out on the files shares. Even the authorship information is unreliable in the Word documents, due to the analyst’s long-held practice of sharing document templates among themselves. The team does find that the most consistent metadata exists in the Lotus Notes database that is used for publication, but these metadata items will not be easy to maintain outside of Lotus Notes, and there has been no control of the “keywords” that authors assign to their documents before publication.

Interestingly, the opposite is true in the Lotus Notes database that is used for internal discussion. Because employees have been eager to share their expertise with their peers, this database has a rich set of keywords that are used to create Lotus Notes “views”, or subject-driven indexes to the database content. Employees have been further motivated by the knowledge that this internal forum is soon to be converted to a Wiki that will be accessible on the corporate Intranet site. Zathras’ corporate culture has traditionally rewarded its most innovative thinkers, no matter which department they are from, so employees are eager to be first among their peers.

While these team members prepare their findings, the research librarians begin to collect data about the types of questions the analysts typically ask. The librarians have always kept records of these queries and are easily able to analyze trends. The IT member of the team is also able to provide search logs from Zathras’ existing site, and the librarians work to combine these two sources of user terminology into a cogent list that they keep in an Excel spreadsheet. This list, along with the content type information collected earlier, will form the basis of the new taxonomy.

During this work, however, the CTO comes back with a new idea. She has recently read about new advances in text analytics tools, and has heard that these tools can also be used to create taxonomies. She’d like Bert’s team to take the lead in exploring these tools because she has been getting increasingly concerned about how manual the taxonomy-creation process has seemed so far. She has also heard good things about these tools from the team that has been talking to the enterprise search engine vendors and would like Bert’s team to help her substantiate their claims.

While Bert finds this interesting, he also has a dilemma. Researching the text analytics tools will put a serious strain on his existing resources, and could adversely impact his schedule. He goes back to his CTO and asks if he might merge his taxonomy development team with the IT team that is researching enterprise search. The CTO approves of his idea, but reminds him again of the project schedule, which has high visibility with Zathras’ “C-level” staff.

Given this new mandate, Bert again focuses on content, and asks several members of his team to identify a subset of the content to use to test the text analytics tools. He meets with the members of the search engine selection team to understand where they are in the vendor selection process, and is dismayed to hear that they have not made any provision for taxonomy integration in their vendor RFP. When pressed, the team lead admits that he himself isn’t sure what Bert means when he talks about “taxonomy” and that he would like to see a sample or mockup. Bert goes back to his reference librarians with this request and is surprised when they also resist. After a sleepless night, Bert realizes that the team needs a clear definition of how the taxonomy will really be used with the new search engine. He wonders which problem this team is really trying to solve with this new taxonomy.

Although not a member of his team, Bert decides to consult with Zathras’ User Interface designer to see if she can help him paint a clearer picture of what both teams are trying to achieve. She feels that the current search engine produces too many results for users; her goal is to get consumers directly to the reports they need sooner. She has looked at other consumer web sites and likes the faceted search approach, but feels that Zathras’ users look for reports in three ways: by subject (for example “Toys” or “Refrigerators”) by specific product name (for example “Barbie”) or by company (for example “Maytag.”) She has also been given the mandate to include “Web 2.0” features in her design and would like to provide a mechanism for consumers to tag reports. She’s not sure how this will work yet, but she includes a “tagging widget”, a small component that only supports end-user tags, in her design. She shares her preliminary wireframes, which are mockups of the components of her design, with Bert.

Given this information, Bert feels that he finally has clear direction for his team, and he convenes a meeting to assign tasks. He asks the librarians to create a topdown structure for the taxonomy, based on high-level subject area, and on company and product names. He also asks that they begin to look for sources of company and product information so that they can add definitions to their terms as they compile them. This is a relief for the librarians, because they are expert in identifying sources of information and in using external thesauri to find terminology to help shape research queries. Since the taxonomy is going to support consumer search, the librarians decide to include synonyms in their taxonomy design. They realize that consumers might need information on which companies make consumer products, so they decide to model a relationship type that links products with their parent companies. Bert commends them for this decision because he sees that it is a first step towards a Zathras “ontology,” which will eventually include additional knowledge about companies and products. He asks them to begin to write a “rules of the road” document to record these taxonomy design decisions.

Bert then asks his content creation and technical team members to work together on evaluating available text analytics tools. He asks members of the search engine team to press the search technology vendors for an understanding of which of these tools (if any) they support, to help shape this analysis. He also asks this team to compile a list of product names to use to drive this test; he knows that the C-level execs will also be interested in understanding how often specific product names are referenced within the reports that Zathras produces.

Thought it isn’t in place yet, Bert realizes that he will need a mechanism for the content analysis team to feed their results back to the librarians; he realizes that the text analytics results will be a rich source of the synonym information that the team needs. He also realizes that the entire team will need to become proficient in more than one software tool. Armed with this information, he writes up a job requisition for a “Technical Architect.” He expects that this person will pull together a high-level view of how Zathras’ existing content will feed both the new taxonomy and, eventually, the search engine.

All of Bert’s team members are busy compiling their results in the 6 weeks it takes to bring this new technical architect on board. The librarians have come up with a basic, high-level subject taxonomy that includes about 250 terms, and they are working to define synonyms. All of their information is stored in an Excel spreadsheet that they mail back and forth to each other. They have also contracted with an acquisition resource and are evaluating various external sources for company and product information. They primarily look at scope, timeliness and delivery format. Bert is beginning to hear them grumble a bit about finding terms and inserting synonyms into the spreadsheet, but their method is working so far.

The content analysis team has also come back with their findings, but theirs are more mixed. Given the list of product names, the text analytics software has been able to identify some of the terms, but has also come back with partial names and other unexpected results. The team realizes that they will need someone to periodically review these results on an ongoing basis in order to make them useful. The team has also experimented with using the software products to extract noun phrases, and has passed the resulting lists of phrases to the librarians to include in the subject portion of the taxonomy, but this activity is also taking longer than anyone expected. Bert feels that he has a good start on the project and reports back in to his CTO.

But Bert is disappointed when she says that she feels she doesn’t have enough information to “put it all together.” What she is missing, she says, is the piece that links the terms in the taxonomy, which look fine on the spreadsheet, with the actual reports. She needs to see how well these terms will work once they are associated with documents.

When Bert relays this to his team, he is again gratified with their positive response: they have also been looking for feedback. Zathras doesn’t have a dedicated usability expert in-house, so he again contacts his friend in User Interface design, who suggests that he try a “closed card sort,” a technique that has users match a small set of documents with pre-defined categories. The librarians agree to try this approach with members of the Sales and Marketing team.
But Bert is still worried about cost, and he does not believe that he has the resources required for manually matching terms with documents in the long run. He asks his new technical architect to come back with some suggestions for a more automated approach.

Josh Gordon, Bert’s new technical architect, is very familiar with search, having just come from a company who used one search engine successfully for years on their Intranet. Josh suggests that this might be another place where the new search engine vendor can help out. Josh volunteers to work with key members of Bert’s team to develop a categorization strategy for Zathras. Bert reminds Josh that he is constrained by budget and by schedule; the CTO has committed to a “Web 2.0” consumer experience by year-end.

Josh realizes that Zathras’ primary function is research and information-gathering, and that Bert will be hard-pressed to hire additional resources to support categorization work. Josh knows that rule-based categorizers are the most accurate, but he worries about staffing requirements, both to write the rules, and to test them again Zathras’ content. He eventually recommends a hybrid approach, combining machine-learning techniques with rule-based techniques, and suggests that the team look to the search engine vendors to help provide a solution in that area.

Armed with Josh’s technical architecture, the librarian’s taxonomy design, the user interface wireframes, and the categorization recommendations, Bert goes back to his CTO and reports his status. She compliments him and his team on their good work, and gives him the green light to proceed with the search engine purchase and eventual taxonomy implementation.

Normal.dotm 0 0 1 2210 12597 InfoClear Consulting 104 25 15470 12.0