we can identify is equal to M minus W plus one.

The reason for that is that basically when they reach

the end of the pattern,

the ones that basically reach

out do not count as the w-grams which means that they don't exist.

So, that we can have,

if basically the length is M and the window length is W,

there are at most M minus W plus one w-grams of query string M. Now, remember,

in this problem, we are trying to see if there's a match between P

and Q with at most K mismatches.

So, let's basically assume that there is a mismatch

somewhere here on the query string Q, there's a mismatch.

Now, what you will first notice is that this mismatch will be

aligned with multiple w-grams,

but it is not going to be aligned with any w-gram that finishes before that,

and it is not going to be aligned with any w-gram that starts after that position.

So, in general, it is easy to see that each mismatch will affect.

So, this is a mismatch.

This is a window that's affected.

The number of windows that are affected

by a given mismatch will be W. This

is easy because essentially the mismatch if this is my window W,

if the length is W, say W is equal to five,

the mismatch may happen at the first position,

the mismatch may happen at the second position,

the mismatch may happen at the third position,

the mismatch may happen at the fourth position,

the mismatch may happen at the fifth position.

So, each mismatch essentially can affect if my W is equal to five,

the number of w-grams that may be affected is also five.

So, this means that each mismatch can

affect five if W is equal to five can mismatch affect five w-grams or more.

Generally, each mismatch can effect at most w-grams.

But remember, I am allowing for at most k mismatches.

This means that the maximum number of w-grams that contain a mismatch should be KW.

So this is the number of out off

the M minus W plus one w-grams that I have in my query string.

The number of w-grams that may contain a mismatch is KW.

So, what does this mean?

This means that, a total of M minus W plus one

minus K times one w-grams must have exact match.

They must have an exact match in string P,

because my mismatch cannot affect those.

The K mismatches cannot effect more than this many w-grams.

This means that,

if I identify M minus W plus one w-grams,

and given an upper bound of K errors,

if I check W minus M minus W

plus one minus K times W w-grams in P,

and if I find that many,

at least that many w-grams,

then there's a chance that these two strings may contain an approximate match.

If however I cannot find this many w-grams,

then what I can tell you is that,

we are guaranteed that

these two strings do not contain an approximate match with the given error bound.

So, we can prune this away.

Note that again, I didn't compute any distance between the two strings,

I simply divided my query string into it's w-grams,

I search for this w-grams in my P string,

and if I can find at least this many w-grams with

the exact match then I know that there is a chance these might basically be similar.

So, I should go and basically pay the expensive distance cost.

Note that, the good news is that we will not learn about data algorithm now,

but since we are talking about sub sequence match,

we are trying to find w-grams of Q in P that they are doing sub sequence match in

P by using suffix trees that are

efficient algorithms to implement this w-gram counting algorithm.

We are not going to learn that,

but there are very efficient ways to search for the w-grams of Q in the P string.

They can be very long,

but the search is going to be linear time rather than being quadratic.

That is a big gain and that is very important in data exploration.

We need to make sure that these searches can be done

quickly so that the user can basically be provided

with the data for export purposes in near real-time.

Quadratic is not acceptable, linear is okay.