I have actually considered this quite a bit, being both a linguist who studies these things, and a scholar who publishes papers.
Etymologically speaking, the word data is the plural of datum in Latin. In Latin, data would get plural verb agreement. Now, languages borrow words and do whatever they want with them, so this historical fact about data has no relevance in judging what is "correct" in English. There is significant evidence that data has established itself as a mass noun in English, suggesting that, for most people, "data is" is the most natural way to speak.
However, in a university/scholarly paper, I would recommend using "data are", rather than "data is".
The reason: some stickler professors and pedantic scholars believe that, logically, if datum is an English word for a single piece of data (which it is), that data must logically be plural. The fact that most people do things differently only means, to them, that most people are doing it wrong. Whether you agree with that or not is somewhat irrelevant.
So you have two choices.
If you use "data is", then reasonable people (yes, I am biased) who read your paper will not bat an eye, but stickler professors might judge you on your perceived ignorance or inappropriate level of informality.
If you use "data are", then the stickler professors will not judge you to be ignorant, and the reasonable people will think "that's an acceptable variant" or "this person is a stickler for language" (or if they are me, will think "this person is pandering to the sticklers — a necessary evil"), but nobody will think you are ignorant.
So, choosing (2), "data are" is clearly your safest bet, and is what I always do (and what I find nearly all of my colleagues do).
In most languages indefinite articles stem from that language's word for one. For instance in French un, or in German ein, In Italian and Spanish uno or in Portuguese um.
English is no exception: an was derived from one. Note that an was the original indefinite article; the shorter a came later when the final "n" was dropped before consonants.
In some of the languages I mentioned above, the plural form of the indefinite articles is simply formed by applying the noun plural inflection: unos/unas or uns/umas.
In others, such as German and Italian, there is no plural form to the indefinite article. Italian use the partitive article degli/delle as a substitute and this is probably also the origin of the French plural form des.
For some reason English did not go through this last step either. To understand why we need to go back to the way Old English solved the problem.
In Old English adjectives have a different declension depending on whether the noun they qualify is determined or not.
"The glad man" reads
se glæd guma
whereas, "a happy man" is:
glæda guma
As one can see, only the adjective changes.
For one given adjective, you could therefore have different inflections depending on:
- the noun gender (masculine, feminine, neuter)
- the noun being singular or plural
- the four cases (nominative, accusative, genitive, dative)
- whether the reference is definite or indefinite.
So that the same adjective would have to follow either the "definite" declension or one of three "indefinite" declensions.
þa glædan guman
Edit
<conjecture>
The theory I'm trying to check (community please feel free to edit) is that in various languages (Icelandic for a language very close to Old English or Romanian) the article is added as a suffix to the noun. Then it often "detaches" and passes in front of the noun. Icelandic is half way through for the definite article in that matter.
As for the Old English indefinite article, my conjecture is that the process never went through for a number of possible reasons:
- The "loss of inflection" of early Middle English won the race
- The plural of "an" was not easy to evolve at that time (the Romance "-s" plural had not imposed itself yet).
</conjecture>
But the need is still there, just as in any other language where a specific word emerged for the plural indefinite article. This gap is filled by placeholders such as some or a number of.
Most linguist agree that Proto Indo European did not use articles.
Latin does not have any kind of article, and Ancient Greek arguably had no indefinite article either - it was using something very much like present-day English some (τις - "a certain"). And I believe that Old German did not have any article either.
It is a very remarkable fact that articles appeared in many modern Indo European languages in a largely mutually independent yet very similar manner. My feeling is that their emergence compensates for the gradual loss of inflection in these languages. But then present-day German is a powerful counterexample...
Best Answer
"While I agree that colloquial use allows for "data" (and perhaps thus "metadata") to be a mass noun (i.e. singular), I am more interested in the case where "data" has already been accepted and used as plural in a particular context."
Then, in this case, you would use metadata in the same way as data because "meta" is simply a prefix; the core word should receive consistent treatment. Metadata should, in this case, be treated as plural.