Web-based abstracting and indexing databases (A&Is) are one of the top means that scholars use to conduct research. These virtual treasure troves of information seem to “understand” the content that they house and can respond to search queries in a matter of seconds.
Of course, A&Is can’t “read” human text (at least not yet!). Instead, A&Is process content using metadata available in machine-readable markup languages or computer code. Journal publishers that want their articles to show up in relevant databases must make machine-readable article metadata available to them.
If you only publish journal articles in human-readable formats, like PDFs, you’re likely missing out on valuable indexing opportunities. Let’s dig deeper to explore:
- How indexes process information
- Ways to produce machine-readable article files and submit them to indexes
- JATS compliant XML - the standard indexing format
Finally, we’ll put it all together to overview how you can start producing metadata and even full-text articles in machine-readable format!
Indexes ingest information in machine-readable formats
Let’s take a walk in the proverbial shoes of an academic index, shall we? Indexes are hungry for knowledge! But they can only ingest information given to them in machine-readable formats.
There are two ways to feed hungry indexes:
- Manually entering article metadata into index deposit forms
- Submitting machine-readable article files to indexes
If you don’t produce machine-readable article files, working with discovery services that support manual data entry is your only option. In this case, the form acts as a conduit to convert the article data you enter into machine-readable metadata that the index can understand.
From the offset, the manual approach is limited as not all indexes offer the option of manual data entry. Many indexes, like MEDLINE, will only accept article-level metadata submitted in XML files (more on this later). When indexes allow for manual data entry, it’s a tedious process. And, even if publishers can carve out the time and resources, manual data inputs are often inadequate. Indexes require rich metadata to process articles in a meaningful way.
The second option, depositing machine-readable article files into indexes, is better for publishers and A&Is. First, it’s a lot faster for publishers because it eliminates the need for manual data entry. Indexes can ingest and “understand” machine-readable article files as they are. Machine-readable article files also result in higher-quality indexing when they contain rich metadata.
Extensible Markup Language or XML is the standard markup language used by academic journal indexes. Let’s take a look at the options for producing machine-readable article files and depositing them into A&Is.
Ways to produce machine-readable article files and submit them to indexes
There are two types of machine-readable files that indexes use to process article information: front-matter XML article-level metadata files and full-text XML article files. Depending on the index you’re applying to, you may need to produce full-text XML articles. It’s safe to say that all indexes will require font-matter metadata. Let’s take a look at both types of files and what they should include.
Front-matter XML files contain the front matter of the article but do not include the article’s actual body text. Core front-matter metadata includes:
- Journal title
- Publisher name
- Article title
- Authors’ names
- Article abstract
Front-matter XML files can also include other rich metadata such as authors’ ORCIDs and funder information.
As the name suggests, full-text XML article files contain the complete article text in machine-readable language. Both of these formats are superior to manual data entry. Full-text XML is the most robust option allowing for text and data mining.
When publishers are ready to deposit either front-matter or full-text XML files into indexes they can usually do so in one of two ways: either uploading article files to indexes in batches (usually via an FTP server) or setting up automatic article deposits via an API content deposit feed. API stands for “Application Programming Interface” and is essentially a channel that different software applications can use to communicate with each other.
JATS XML - the standard indexing format
In conversations and documentation regarding indexing, you’ve likely come across the term “JATS” at some point, and you may be wondering what it means. Whereas XML is a language, JATS is a type of syntax. JATS stands for “Journal Article Tag Suite.” It is a specific way of formatting XML files developed by the National Information Standards Organization (NISO). JATS is considered the technical standard for journal articles and is preferred or required by many academic indexes, including all National Library of Medicine (NLM) indexes - PubMed, PubMed Central, and MEDLINE.
Formatting your journal articles in JATS XML is a best practice and will enable you to add them to indexes more quickly and easily.
Putting it all together and getting started
Now that you know why producing journal articles in machine-readable formats is better for abstracting and indexing you may be wondering how to get started. As you’ve likely gathered from this post, machine-readable article production is pretty technical.
You may have access to technical staff that can help with XML article production at your publishing organization. If not, you can still get the XML article files you need with the help of a service provider like Scholastica. Scholastica automatically produces front-matter JATS XML files for all journals that use our OA Publishing Platform and full-text JATS XML files for all journals that use our Production Service. You can learn more about how Scholastica is helping OA journal publishers automate indexing steps in this post.