Skip to Main Content

Recommended Digital Tools

JSTOR Text Analyzer

JSTOR HomeText Analyzer is a beta tool built by JSTOR Labs. With it, researchers can search for content on JSTOR just by uploading a document.

How it works

  1. Upload a document with text in it. This can be anything: a paper you're writing, an outline of a work in progress, an article you just downloaded, even a picture of a page of your textbook. (Don't worry, we won't store or share the text.)
  2. The tool analyzes the text within the document to find key topics and terms used, and then uses the ones it deems most important — the "prioritized terms" — to find similar content in JSTOR.
  3. Review the results and download any articles you're interested in.
  4. Adjust the results you're seeing by adding, removing or adjusting the importance of the prioritized terms.

File types supported

You can upload or point to many kinds of text documents, including: csv, doc, docx, gif, htm, html, jpg, jpeg, json, pdf, png, pptx, rtf, tif (tiff), txt, xlsx. If the file type you're using isn't in this list, just cut and paste any amount of text into the search form to analyze it.

Languages supported

English, Arabic, (simplified) Chinese, Dutch, French, German, Hebrew, Italian, Japanese, Korean, Polish, Portuguese, Russian, Spanish and Turkish (see FAQ, below, for details)

Hints & suggestions

  • The more text within your document, the better.
  • Be sure to use the controls to add, remove and adjust the importance of your prioritized terms. Add your own term or phrase if you're not seeing it.
  • The results are created using only the prioritized terms: be sure to add any identified term you want included.
  • If you access Text Analyzer using your phone, a camera icon will appear — use it to take a picture of any page of text and search with that.
  • To run Text Analyzer on the text of a webpage — whether it's a Google Doc or a NY Times article — drag and drop or paste the URL into the search box.
  • Get creative with the kinds of documents you search with: try your class syllabus, the webpage of a news article, or the first paragraph or outline of a paper you're writing.
  • Try searching with non-English-language content if you have it — Text Analyzer can help you find English-language content about the same topics in JSTOR.

Frequently asked questions

Does uploading my paper to Text Analyzer mean that it's now in JSTOR?

Nope. In fact, JSTOR doesn't even store the document you use with Text Analyzer. The tool analyzes the text within the document and extracts the relevant terms without retaining the text itself.

I get some pretty weird recommended topics. What happened? What should I do?

Text Analyzer is still in beta and is, frankly, a machine. It's not perfect. When it recommends strange topics, this can be because there wasn't as much text for it to analyze or because the text contains language (such as an extended metaphor) that "fools" Text Analyzer into thinking it's about something it's not. It can also happen if the topics covered or language used doesn't map well to the rest of the content in JSTOR. JSTOR has a wide variety of content and covers many disciplines, but it doesn't have everything.
 
Usually, when Text Analyzer recommends a topic that's not what you're looking for, all you need to do is remove it from the Prioritized Term list. If you're still not seeing what you're looking for, try adding a few terms that are more on-point.

How does Text Analyzer *do* this?

Getting to recommended articles is a multi-step process involving a number of different technologies. First, Text Analyzer extracts the text from the document or image. For Word or PDF documents (for example), it just pulls out the existing text. For images without embedded text (for example, a picture of a page of text), it performs Optical Character Recognition (OCR) to find the text.
 
Next, the tool analyzes the text to find topics (e.g. subjects) and entities (people, places and organizations) within it. The topics are found by using a "topic model," a tool used in natural language processing. In a topic model, a topic is composed of many individual terms that suggest the topic is being discussed. The higher the density of those terms in the document, the more likely that a particular topic is being discussed. For example: if the terms "carrots," "seed," "harvest," and "backyard" are used a lot, the topic model might suggest that the topic being discussed is "Gardening," even if the term itself is never used. The topic model used in Text Analyzer was created by analyzing all the scholarship in JSTOR. In doing so, we were able to leverage JSTOR Thesaurus, a controlled vocabulary of over 50,000 terms describing the content within JSTOR, for help in both naming the topics and in "training the content model."
 
Entities are identified using multiple entity recognition services and tools, including Alchemy (from IBM), OpenCalais (from Thompson Reuters), the Stanford Named Entity Recognizer, and Apache OpenNLP.
 
Last, the tool uses what it "thinks" are the most relevant topics and entities to find similar content in JSTOR. This similar content is presented on the results screen along with the topics and entities it found.
 
Still want to know more about the technical details? Check out this JSTOR Labs blog post.

Does Text Analyzer work with non-English language content?

It does! The tool can analyze content in a number of different languages (see “Language Supported,” above). When you upload, say, a Portuguese document, Text Analyzer uses a language-specific topic model to identify the key terms. (When confronted with a multilingual document, the tool will select the most prevalent language used.) Each topic inferred in a native language is associated with a corresponding topic in English, which the tool then uses to find relevant English-language content.

Is there an API for Text Analyzer? Is there a way to use it on another set of content?

There is! You can see details about it here: http://labs.jstor.org/api/docs/. Currently the Text Analyzer API is open to beta partners only -- if you're interested in being one, please let us know!