Find True Search Query in Logs with Splunk

Sometimes, each user typed characters in search query will be logged in logs, and it will cause the log to be like this:

It will be difficult if we want to do some analysis or summary on these “overlapped” search queries.

Here is a way to only keep the true search query we want like following:

Out-Of-Box filtering

| `ut_shannon(query)`
| `ut_bayesian(query)`
| `ut_meaning(query)`
| where ut_bayesian<.9959 AND ut_shannon!=0 AND ut_meaning_ratio!=0

ut_shannon removes single letters, ut_meaning_ratio removes word and digit combination that doesn’t mean much, ut_bayesian removes all string that looks like DGA which can be used for random ID that is specific.

Catch

| stats values(query) as query values(http_uri) as http_uri values(index) AS idx values(sourcetype) AS st min(_time) as event_time by user timespan
| sort - timespan
| mvexpand query
| sort - timespan user query
| streamstats current=f last(query) as last_query count by timespan user
| eval flag=if(NOT match(last_query+"*",query),1,0)
| where flag=1
| stats values(query) as query dc(query) AS dc_query values(http_uri) as http_uri values(idx) AS idx values(st) AS st min(event_time) as event_time by user
  1. Group the log via timespan and user
  2. Sort the result with timespan
  3. Expand the result with query field
  4. Create a new field “last_query” which will be the future value for field query
  5. If future value of query field contain the current value, set the flag to 0
  6. If future value of query field did not contain the current value, set the flag to 1
  7. Only keep the result with flag 1
  8. Group the result again