You raise a valid concern. On the one hand, we often talk of periphrastic tenses (and other constructions); on the other, some insist that a tense should be confined to a single word. Others, again, hold that tense is a property of a sentence or clause, not of a word or phrase. Can this problem be solved at all?
The short answer is: there are different models; some models are incompatible with certain other models; and we are free to choose whichever model we prefer. The term periphrastic tense is useful in a model that allows for tenses that consist of more than one word, but not in a model that doesn't. The definition of "tense" is not an objective fact that exists independent of human analysis: it is ultimately a label of convenience created by the observer. Both kinds of models have merit.
Most language users happen to think of will do as the future tense. Some linguists use other models. There is no consensus, not even among linguists, about what constitutes a tense.
Even word boundaries are not objective facts
Perhaps the most fundamental issue you raise is that of word boundaries. What were once considered two separate words may fuse into a single, new word, as in cantare habeo => chanterai. At some point in its development, the status of this phrase-or-word must have been uncertain. This shows how relative the whole terminology is.
But in most cases, a reasonable case can be made for either one or the other, so that the fundamental issue temporarily recedes to the background; it should be noted, however, that what we consider a "word" is to some extent intrinsically subjective and a matter of convention. It is just a convenient demarcation. But let's move on.
Is tense determined by form or by function?
Let me illustrate the problem by means of Latin, where terminology has been fixed for a long time. Tense comes from Latin tempus, "time"; part of the oldest concept of tenses had to do with notions of time. However, there was never a one-to-one correspondence between tenses and temporal references. The pluperfect, for example, is normally used to refer to a time before a narrated time in the past, just as in English; and yet after postquam, "after", the perfect was used, not the pluperfect. Similarly, the imperfect and pluperfect could be used to refer to an hypothetical situation in the present, as in English if I was rich... (although subjunctives were far more common). And so on.
Si domi eram, pater me puniebat. = If at_home I_was, father me punished.
"if I were at home, father would punish me."
Postquam Galliam vidi, vici. = After Gaul I_saw, I_conquered_it.
"After I had seen Gaul, I conquered it."
And yet we still call the verbs in these examples imperfect and perfect, respectively, even though they do not have their usual temporal references. The reason we do this is that the form is named after its most common function, even though it can indeed have other functions. Latin and English do this and are by no means the only languages.
Do we then look only at the form of the verb, not at its function, when defining tenses in Latin? No. What we call the passive perfect is periphrastic/analytic/compound, just as in English:
Canis sum. = Dog I_am.
"I am a dog."
Visus sum. = Seen I_am
. "I am/was seen."
You could say this is not a special tense, but two words, one being a past particple, the other a present verb; and yet this is called the passive perfect. The reason is that it functions just as the perfect does—except that it is passive. Here function determines what we call it. This happens in English too when we say I will do it is in the future tense.
Humans like symmetrical systems
So then what constitutes a tense, if we can count neither on form, nor on function, at least not reliably so? The answer is probably symmetry. If there is a present active (video "I see"), a present passive (videor, "I am (being) seen"), and a perfect active (vidi "I saw"), we would like there to be a perfect passive. Because there was no such verbal form, a phrase was made to be equivalent, (visus sum "I was/am seen in the past"). We humans like our systems neat and symmetrical if possible:
Active Passive
Present video videor
Imperfect videbam videbar
Perfect vidi [visus sum]
Future videbo videbor
Now is this label "passive perfect" merely a convention? It may have been once, but, as people start believing in it, they start using it in ways that neatly fit the system, even if the meaning of visus sum was once somewhat different. It is in some ways a self-fulfilling prophecy. Whenever a sentence in the active perfect was passivated, instead of saying "oh, I can't do that", people started thinking, "this is the passive perfect; I will use it". The same applies to I will do it in English.
All three approaches have up-sides and down-sides
Is this a perfect system of terminology? No. There are serious disadvantages. But it has been in use for a long while, and most people think of "I will do it" as fitting within a neat system of past, present, and future, because that is the most convenient and obvious partition of our verb tenses, or so we feel.
Various branches of linguistics have proposed different systems and different terminologies in the past. This is a productive and beneficial approach. Some chose to focus on form and consider the English periphrastic future not a tense at all; they will only count affixes and endings as capable of forming tenses. This system certainly has merit.
Others have emphasised function; they have gone so far as to declare that, since many forms can be used for more than one function, as with si eram... / "if I was...", only foregoing form altogether leads to a consistent approach. Hence they treat tense as a property of a clause or sentence, not of a word or phrase. That way, only combined with a word like yesterday does was acquire a past tense; in if I was at work today, you wouldn't see me here, it is a present tense, because it refers to a situation in the present, be it an hypothetical one. This approach, too, has merit.
One could use several systems at once
As an alternative, we could invent new words for these two new approaches, such as *single-word tenses for the English simple present and simple past, and time-reference or temporality for the time-reference of a clause or sentence. Many different models are possible. Insisting on one model without considering the benefits of other models seems unwise. And saying "x is A" when you mean "I find the model in which x is called A most useful" is a simplification.
Suppletion as an illustration of a convenient choice
Some systems are uncontested, even though at some point in the past a fairly arbitrary choice must have been made.
I go.
I went.
Do these two forms belong to the same verb? Yes, you will, say, because that is what you were taught, and because they "feel" like the same verb, just with odd forms. But, in the past, there were two verbs, both meaning something like going (although there were no doubt some differences between them). At some point the present form of a verb resembling go was taken, its past forms discarded (or not, if such never existed), and the past form of a verb resembling went.
We could say, "there are two defective verbs in modern English, one lacking a past form, the other a present form"; but we choose not to do so. That is to some degree arbitrary, but in this case it is just very convenient. If certain linguists would prefer to treat them as two different verbs, then let them do so, if this is somehow more convenient in a certain linguistic analysis. Or they could just say "this verb consists of two different roots", as they no doubt do.
Best Answer
I would say that, rather than being a matter of either grammar or inflection, the use of -t- or -tt- is just a convention about the spelling of these words. The pronunciation of the double -tt- in "omitted" doesn't contrast with the pronunciation of the single -t- in a word like "literally" or "mitigation". The usual phonemic transcription of a word like "omitted" would only include one /t/. The inflectional suffix is the same in both vomiting and emitting: it is -ing.
As outlined in the answers to "Focussed" or "focused"? Rules for doubling the last consonant when adding -ed and various other questions on this site, the use of double consonant letters before certain vowel-initial suffixes in English spelling is very closely correlated with the stress pattern of a word (and also with the type of vowel sound preceding the consonant).
Many English verbs that have forms spelled with -tt- can in fact be traced back to etymological sources with phonetically long /tː/ (in Old English, in some other Germanic language, or in Latin), but I would say that the relationship is only indirect: we see this correlation because a vowel before a historically single -t- was often either unstressed, or lengthened if stressed. For example, the verb hate historically had a short /a/, but this was lengthened in Middle English. The set of lengthening changes like this presumably contributed to the modern English convention of marking short vowels by doubling a following consonant letter.
However, this kind of lengthening was not so regular for the vowels i or u. It seems that the verb put did not actually originally have /tː/ (the OED says the corresponding OE forms are probably something like *pūtian, *putian, pȳtan, potian). Similarly, the verb nut is derived from the noun, which corresponds to an Old English form with singleton /t/, like hnutu. Despite the etymologies of these words, putting and nutting are spelled with -tt- in present-day English.
The verb vet is ultimately derived from Latin veterīnārius, with singleton /t/, but we write vetting and vetted.
If "emitting" does phonologically contain double tt, then "emit" probably does also
Some linguists have considered the possibility that the English sound system might contain some very abstract elements and processes: for example,
a process that shortens underlying aː or lengthens underlying a to produce the surface alternation between æ and eɪ in certain words like sane/sanity;
a word-final vowel that has zero as its surface realization (supposedly present in words like ellipse);
a process that turns underlying ng into ŋ in certain contexts (e.g. in word-final position)
In theories like this, I have seen reference to the concept of "virtual geminate" consonants that are supposed to explain the otherwise exceptional pronunciations of certain words; I think that for example, the final stress of words like omit, emit, permit could be interpreted as a sign that they "underlyingly" end in tt even in the present-day sound system of English. But I don't think many people actually believe that theories like this are true. Even if it were true, this tt would be part of the root: words ending in tt would take the same inflectional suffixes as words ending in single t.