Wikipedia:Category intersection

From Wikipedia, the free encyclopedia
The blue category A, the pink category B and the violet intersection called A ∩ B

Category intersection is the ability to find all articles that are members of more than one category. It requires a change to the MediaWiki software as well as a major change to the policies related to how categories are populated. It is hoped that these changes will solve some long-standing categorization problems and end some common conflicts between Wikipedia editors. Category intersection also offers the possibility of adding several new features that will benefit users by adding valuable research and indexing tools as well as making the category system easier to manage.

Many existing categories are logically the intersection of attributes for which "primary" categories exist, for example Category:American actors is logically the intersection of Category:Actors and Category:American people. Although these "primary" categories are today generally subdivided into subcategories, if they were directly (fully) populated the "intersection categories" could be automatically generated. Categories in the German Wikipedia are already organized into fully populated primary categories.

This proposal tries to envision the changes necessary to make category intersection a reality. It is designed to augment the current categorization system, not replace it.

Background[edit]

When categories were initially added to Wikipedia in 2004 there was no mechanism to limit the search result for large categories. Very large categories caused performance problems, and a software change was made to limit the search result to 200 entries at a time. If there are more than 200 entries, users must navigate through multiple pages in order to see all the entries. This page by page navigation mechanism becomes impractical with large categories, as it takes much too long to see the entries at the end of the alphabet. The performance considerations of large categories and page-by-page navigation precipitated policies to depopulate large categories into smaller subcategories.

In mid 2005 the category table of contents template, {{CategoryTOC}}, was created. With the table of contents it became possible to navigate through very large categories with a few clicks. Due to the combination of the performance change and CategoryTOC, there is no longer any reason that categories need to be small.

Multiple category taxonomies have been part of the categorization scheme from the beginning. It is possible to take a category and subcategorize it in many different ways. Use of these "subset" categories makes it difficult to find all members of a "higher level" category; either articles have to be added to both the "subset" and "higher level" categories or the members of the "subcategories" (and, recursively, their subcategories) have to be enumerated. Precisely defining the circumstances in which articles should be added to both "lower level" and "higher level" categories, and even whether this is ever appropriate, remains a source of continuing discussion among editors (see, for example, Wikipedia:Categorization/Categories and subcategories and Wikipedia talk:Categorization/Archive 7).

This history has led to several overlapping views about the purpose of Wikipedia's categories and to the creation of several distinct kinds of categories:

  • Categories are a tool for browsing: they function as a table of contents, leading users to the articles on a specific subject. An example category of this type is Category:Film actors.
  • Categories are a means of classifying articles: the current conventions encourage placing articles in the most specific category. Having categorization function as a classification system is often in conflict with categorization as a tool for browsing. For example, suspension bridges are added to Category:Suspension bridges, but not category:Bridges. This makes bridges hard to find by browsing unless the user already knows the type of bridge (or is only interested in certain types of bridges).
  • Categories are an index of a subject: Due to the current conventions for categorization, many topic level categories are not usable as an index because they have been broken into subcategories and depopulated. For example there is no way to see an index of all American people. It would be useful to have categories fully populated at the "level of notability". For example directors are much more likely to be notable as "film directors" than as "American film directors".
  • Categories are a database search: Many categories are in essence the intersection of two or more larger categories. For example, Category:American film directors can be thought of as the intersection of Category:Film directors and Category:American people. There are many intersection categories that do not exist that some people might find useful. Adding more and more of these categories clutters up the category listings for articles so they are discouraged and often deleted. In addition, since these categories are manually populated it is entirely likely that an article in both Category:Film directors and Category:American people does not appear in Category:American film directors or, conversely, that an article in Category:American film directors does not appear in one or both of Category:Film directors or Category:American people.
  • Categories are an index of other categories: There are many categories that function simply as an index of other categories. For example, nearly all the subcategories of Category:People by nationality and Category:Categories by country are index categories providing an index of a specific set of "X by Y" intersection categories.

Category intersection has been a desired feature for quite some time. Looking through the wikitech-l mailing list archives, someone even wrote the code implementing a version of category intersection. This comment points out its limitation: "I don't see how this can be more than marginally useful unless it also searches all subcategories to infinite depth (with recursion checks?!)."

Using MediaWiki search to find category intersections[edit]

It is possible to use the Search parameter incategory: to find category intersections; however, this facility does not look inside subcategories. To find a category intersection, type incategory:"CategoryName" in the search box for each category of interest. For example, incategory:"German films" incategory:"1998 films" will return the articles that are common to both categories – German films released in 1998. Similar results can also be found using Wikidata query service.

The core proposal[edit]

  • Fully populate most of the primary (topic level) categories that have been broken into subcategories.
  • Add the capability to create category intersections on the fly. All users will be able to select categories and create an intersection from the selection.
  • Show only primary (topic level) categories on the bottom of article pages.
  • Create a simple interface to select category intersections from any article page.

Fully populated primary categories[edit]

For category intersection to work best, many categories must be fully populated. Categories will need to be populated with ALL articles that meet the definition of the category or to have NONE of the articles because they can be found in subcategories. If a category is fully populated, it would be called a "Primary" category. Primary categories should correspond to topic articles. That means that there is, or could be, an eponymous article for the category. An example of this is Film director.

The general rule would be: If a category can be completely and totally expressed as the intersection of other categories, it is not a primary category and should be defined only as this intersection. For example, Category:American film directors can be defined as the intersection of Category:Film directors and Category:American people, which would in turn be fully populated primary categories. Category:American film directors would not exist as a "regular" category, and would never appear as a category in any article. Articles in both categories would be displayed by selecting to view their intersection. If there are articles that relate to an intersection topic, but for some reason are not in one or more of the intersected categories, they can appear as normal wikilinks in a "See also" section in the intersection category's text description. For example there might be a comment to see an article called American film directors in the intersection corresponding to Category:American film directors.

All existing categories that are intersections would be depopulated and their members moved to the larger primary categories. Some primary categories will be rather large (like Category:American people). Since they are fully populated, each primary category will be a complete index of all the articles in Wikipedia that relate to the topic.

This proposal will change the list of categories that appears on articles. Only the primary (fully populated) categories will appear. For example, the Laurence Fishburne article currently contains the following categories:

Categories: 1961 births | African-American actors | American child actors | American film actors | American soap opera actors | American television actors | Living people | People from Augusta, Georgia | Tony Award winners

Under this proposal it would contain:

Categories: 1961 births | American people | People of African descent | Actors | Child actors | Film actors | Television actors | Living people | People from Georgia (US State) | People from Augusta, Georgia | Tony Award winners

There are a few things to note about this. The definition of some of these categories might be confusing. The "People from" categories are currently defined as people who have a notable connection with the place, but might not be citizens of the larger country. This means that both the smaller and larger subdivisions are primary categories. For the sake of facilitating intersection categories it would probably be useful to fully populate all geographical subdivisions from the level of nationality on down. Likewise, it is not possible to define film actors as the intersection of "film" (or film people) and "actors", because (for example) a person could be a famous stage actor who later became a film director. Articles might belong in these two categories but NOT belong in the "intersection" category, which means "film actors" is not a candidate for an intersection category. For this reason it might be decided to make "actor" and all the "actor by medium" categories primary categories.

New namespace for category intersections[edit]

There will be a new namespace for the creation of category intersections. Pages in this namespace, perhaps called "Index" or "Intersection" would look very similar to a Category listing of articles. In this proposal both names are used, but any other name could be selected when this proposal is implemented. Intersection pages can be created on the fly, simply by typing the name of the intersection you are looking for. For example, you could go to the page Intersection:Actor::American people::People of African descent. Likewise, this would be the mark-up for creating a link to an intersection page. (Note: The precise mark-up and URL might look different from this.) So you could add a link to an intersection page by adding:

[[Intersection:Actor::American people::People of African descent]]

to a page. Like any other link, these links could be "piped" so the text displayed to the user would not have to be the "raw" link. The link would display as a "blue" link (page exists) not based on whether there is an existing page in the intersection namespace but based on whether all the categories being intersected exist in the category namespace. This means any intersection of existing categories would appear to exist, whether a user has previously "created" the intersection page or not.

Intersection pages will look more or less like category pages. The title of the page would be displayed, possibly followed by manually generated content (added by clicking "edit"), then the first 200 automatically generated links to the subcategories and articles that are members of all the intersected categories (much like a regular Category listing), perhaps followed by a mechanism to expand or further limit the intersection.

The page title will list the categories being intersected in the order specified in the URL used to access the page. Because Category A intersected with Category B is the same as Category B intersected with Category A, intersection pages have a number of built-in synonyms. More about this later.

User created category intersection[edit]

There will be several ways for users to create category intersections:

  1. By typing the URL of the intersection.
  2. By typing the name of the intersection in the "Search box" and clicking on "Go".
  3. By creating a link to the intersection on a page and then clicking on the link. (This will be useful for discussions and for creating lists of intersection pages.)
  4. By selecting categories listed at the bottom of article pages.
  5. (in some variants) By selecting other categories to intersect from another intersection display

The fourth (and fifth) option would be a new and powerful feature. Using the same Laurence Fishburne article as an example, instead of the existing category listing the categories might be displayed like this:

Categories: 1961 births | Living People | American people | People from Georgia (United States) | People from Augusta, Georgia | People of African descent | Actors | Film actors | Television actors | Tony Award winners
[Show articles in all selected categories]
The exact wording of the link might be different, e.g. "Create index using all selected categories". There might also be a link that says "What is this?"

This arrangement is very similar to how tags work at Flickr.com, Delicious.com and IMDb's Movie Keywords Analyzer. The existing category listing would have a check box added beside each category. Any user would be able to view the result of a category intersection by checking the boxes next to the categories and then clicking on the link to view the intersection set. In this case the three checked boxes would lead to an intersection listing that is functionally very similar to the current Category:African-American actors, but dynamically generated based on an intersection of the selected categories rather than manually populated. Many existing categories could be replaced with intersections, and with this system any intersection is possible, including ones that have been previously discouraged and/or deleted via WP:CFD.

This adds a small amount of category "clutter", but adds the possibility of generating the intersection of any two or more categories. There may be a few more "primary" categories than now exist, but overall there may ultimately be fewer categories listed per article. ALL of the categories appearing at the bottom of an article would be fully populated primary categories and so would be useful as the components of intersections. You would be able to see the intersections even if nobody had explicitly created an intersection page for it, for example what would now have to be Category:African-American film actors from Augusta, Georgia who won a Tony Award. This creates the effect of having scores of categories without cluttering up articles.

Searches in the Intersection namespace will be done by first sorting the intersected categories into alphabetical order before doing a database query to find an existing intersection. This way any permutation of category order in a URL or link will match the appropriate intersection page.

Options and variations[edit]

There are several variants of this proposal. The aspects that vary relate to:

  • How intersection pages are displayed
  • The interface for navigating around intersection space
  • How intersection space relates to category space
  • The conversion of current categories into intersections.

For each option, mockups and a subpage with further details are provided.

Option: Transclude intersections into categories[edit]

This option closely links intersections with categories. Categories that can be defined as intersections would be depopulated but would still remain in the category structure. Instead of adding articles into the category, the intersection page would be associated with a category page by giving it the category name. Once named, the intersection page would be bound to and automatically transcluded into the associated category. The current categorization structure would not be affected with this option. All currently existing categories would remain, with some being "regular" categories and some being redefined as "intersection" categories.

The basic features of this option:

  • Categories get re-defined as the intersection of fully populated primary categories when appropriate.
  • Articles can be automatically recategorized from intersection categories into the corresponding primary categories. This can happen when a category is first associated with an intersection and then later if any articles are added to the intersection category.
  • The categorization system is protected from vandalism by restricting some maintenance and editing to administrators.

Mockups:

Further details about this option: Wikipedia:Category intersection/Transclude intersections into categories

Option: Named indexes, separate from categories[edit]

This option is modeled after the look of an index that might be found in a book. In this option, the intersection space uses the name "Index" and its pages contain indexes of articles as well as links to more index pages. Like the option above, the "index" pages can be given names. Unlike the option above, the "index" pages are not associated with or transcluded into categories. Categories that can be defined as intersections will be deleted after recategorizing articles into appropriate primary categories. The deleted categories are replaced by indexes which can be categorized or manually linked to category pages. The "index" pages have sets of links to other indexes which are automatically generated by using the subcategories of the intersected categories. This allows users to easily traverse from one index to other related indexes. Pages in the index namespace could be edited (much like categories can be edited), allowing users to annotate the index page with descriptive text, add index pages to categories and add links to other related indexes.

The basic features of this option:

  • "Primary" categories are fully populated (using bots) by recategorizing articles from "intersection categories" into the corresponding primary categories.
  • Categories that can be defined as the intersection of fully populated primary categories will then be deleted.
  • Index pages show the articles that result from the intersection of the primary categories as well as the sub-indexes that are the intersections using subcategories.
  • Index pages will replace "index categories" allowing traversal to numerous "sub-intersections" from the intersection selection table shown on the index intersection page.
  • Some traversal in "intersection space" does not rely on a user created hierarchy.
  • Intersections are given easily understood names, for example "Index:African-American actors" instead of "Index:Actor::American people::People of African descent".

Mockups:

Further details: Wikipedia:Category intersection/Named indexes, separate from categories

Option: Separate intersection space[edit]

In this option the "intersection" namespace would be completely separate from the "category" namespace. Categories that could be defined as intersections would be deleted after recategorizing articles into appropriate primary categories. Every page in the intersection namespace would include an automatically generated intersection selection table, allowing users to easily traverse from one intersection to other related intersections. Pages in the intersection namespace could be edited (much like categories can be edited), allowing users to annotate the intersection page with descriptive text and to add intersection pages to categories.

The basic features of this option:

  • Fully populate (using bots) "primary" categories by recategorizing articles from "intersection categories" into the corresponding primary categories.
  • Delete categories that can be defined as the intersection of fully populated primary categories.
  • Replace "index categories" with index intersections, allowing traversal to numerous "sub-intersections" from the intersection selection table shown on the index intersection page.
  • Traversal in "intersection space" does not rely on pre-created links or categorization of intersections.
  • Intersections have only a functional name including the names of the intersected categories, with no "user friendly" name. This eliminates the need to establish guidelines for these names, or in any way control or manage them.

Mockups:

Further details: Wikipedia:Category intersection/Separate intersection space

Other variations[edit]

Other variations are possible. It is possible to combine, exchange and remove features from the three options above to create other options. We invite participants in this discussion to add any ideas they may have.

Changes to categorization policy[edit]

This proposal, if any of the options are implemented, will have a major effect on categorization policy. Some of these changes can be foreseen, and some will evolve as everyone gets used to the new system. Considerable thought and planning also has to be done before implementing the change.

Once the new system is in place categorization policy will need to be revised. Many aspects of the new system will likely be controversial and it is likely that there will be lively discussion. It is also possible that there will be less controversy than with the current system.

Primary categories[edit]

The main change to policy will be the concept of a Primary category as described above. Primary categories should be tagged as such, so editors will know to fully populate them. Some categories may need to be split because they are both primary categories and navigational categories. A navigational category is a category which contains subdirectories. An example of this is Category:American people by occupation. Navigational categories should not contain any articles. Currently, Category:American people functions as both a primary category and a navigational category. It probably should be split into Category:American people (which would be fully populated with articles about Americans,) and Category:American people by type or something similar (which would have all or most of the subcategories. Category:American people by type would then be a subcategory of Category:American people. This will make it easier to navigate through the subcategories, especially when primary categories are very large and have many subcategories.

Categories as a table of contents: Browsing[edit]

The current guidelines say that categories are primarily meant as a method to browse through articles on a topic. This guideline does not need to change.

Categories as an index: Primary categories[edit]

Since primary categories will be fully populated, they will also function as a complete index of their topic. This feature will no longer be at odds with other functions of categories. The intersection pages will add additional indexing capabilities.

Categories as classification[edit]

Instead of classifying articles by finding the most specific subcategory for the article, they will be classified by finding all the primary categories they belong in. Their classification is in essence the intersection set selecting all their categories. In most cases there will not be any other articles with the same set of primary categories.

Multiple taxonomies[edit]

Many subcategories have been discouraged or deleted because they were not considered important sub-classifications of existing categories. This would no longer be a problem with the new system. Adding attributes to people like gender (Category:Men or Category:Women) or religion (Category:Methodists) should no longer be controversial because sub-categories using these attributes will only be seen if people are looking for them. This will allow multiple taxonomies to coexist.

Currently, certain taxonomies are preferred, such as subcategories by nationality and occupation. This will no longer be the case. No taxonomy will appear to be better than any other. Certainly, taxonomies could still be removed if they are shown to be unencyclopedic. Deleting these taxonomies will only require deleting a single primary category. Once deleted the intersection pages will no longer show any articles. Any links to intersection using the deleted primary category will be red. In option one, any category with an intersection that has a red link should be a candidate for speedy deletion. In all the options, any intersection page that has a red link to a primary category should also be a speedy deletion. This process can probably be automated with a bot.

Categories as a database search[edit]

This system, like the system at flickr.com makes it easy to find articles that are similar in desired ways. From one actor from Ohio, a user will be able to find all actors from Ohio. From one English poet born in 1883, you will be able to find all English poets born in 1883. From one suspension bridge in New York City, you will be able to find all suspension bridges in New York City. This is currently not possible for most searches.

Future related upgrades[edit]

Category viewed as an outline[edit]

Currently there is a clear consensus to not put people in Category:Entertainers and instead, put them into the subcategories of entertainers. It might be useful on occasion to see a complete index of what is in Category:Entertainers, including all the contents of subcategories. A future upgrade might add the ability to view any Category into an outline. Perhaps there would be a link at the top of each category that say "View as an Outline". When the link was clicked, the category view would switch to an outline view. All the subcategories and articles would appear as single alphabetical list. The subcategories would be formatted differently from the articles (perhaps in bold or a larger font). There'd also be another option that said "Show contents of all subcategories" Clicking on this would add the contents of the subcategories to the category or list. If both options are selected the subcategory contents would be indented and listed directly under the subcategory heading. The index view would only go a set number of levels deep and would not show the contents of any categories that are defined as intersections. Perhaps the depth of the index could be a user preference. There might also be a way to "flatten" the outline so that the contents of all the subcategories were combined into a single alphabetical list.

Searching in categories[edit]

The search interface could be extended to include the ability to find articles in specific categories as well.

Tools currently available[edit]

Semantic Mediawiki[edit]

There is a feature in Semantic MediaWiki called Concepts, which solves the problems that Category intersection seeks to solve while extending the concept further.

Mediawiki extension "Multi-Category Search"[edit]

"Multi-Category Search" extension introduces a new special page, that allows users to find pages which are included in several specified categories at once. Transclusion of search results is also available.

Magnus Manske category intersection tool[edit]

Magnus Manske has written tool to do category intersections:

  • PetScan, a category intersection tool which allows complex and quickly-computed intersections

Special:Search / list=search API[edit]

The Wikipedia search, based on Elasticsearch, pages take "incategory" parameters which allow narrowing searches by category. By combining multiple incategory parameters you can intersect categories.

For example:

Let's say you had two categories:
Category:Athletes (track and field) at the 1984 Summer Olympics
Category:French female sprinters
Both groups are a little too big to go through by eye, but the intersection of the two lists would suddenly give you a nice concise list of French female sprinters who were at the 1984 games. It's not really worth making a categories for this but the intersection of the two categories would be really useful behavior.
incategory:"Athletes (track and field) at the 1984 Summer Olympics" incategory:"French female sprinters"

For Mediawiki vanilla search, via https://webapps.stackexchange.com/questions/28412/search-within-a-category-on-a-mediawiki-site, works

"[[Category:Athletes (track and field) at the 1984 Summer Olympics]]" "[[Category:French female sprinters]]"

Comments[edit]

Please respond on the talk page.

See also[edit]