Overview
I am currently working on updating the search application for the Cultural Japan project, and I needed to perform aggregation on multilingual data. This article is a memo of the investigation results regarding the methods.
Data
For the data, we assume a case where the agential (indicating a person) field has values for id, ja, and en.
{
"agential": [
{
"ja": "葛飾北斎",
"en": "Katsushika, Hokusai",
"id": "chname:葛飾北斎"
}
]
}
For the above data, we want to perform filtering by id while displaying the ja or en value according to the language setting.
Ideally, we would like to obtain the following data as the aggregation result.
(When ja is specified)
{
"buckets": [
{
"key": "葛飾北斎",
"id": "chname:葛飾北斎",
"doc_count": 1
}
]
}
(When en is specified)
{
"buckets": [
{
"key": "Katsushika, Hokusai",
"id": "chname:葛飾北斎",
"doc_count": 1
}
]
}
Method 1: Using Nested Aggregation
Following the article below, let’s try nested aggregation.
https://discuss.elastic.co/t/aggregeations-with-different-keys-and-values-label-and-id/274218
DELETE test
PUT test
{
"mappings": {
"properties": {
"agential": {
"type": "nested",
"properties": {
"id": {
"type": "keyword"
},
"ja": {
"type": "keyword"
},
"en": {
"type": "keyword"
}
}
}
}
}
}
PUT test/_doc/1
{
"agential": [
{
"ja": "葛飾北斎",
"en": "Katsushika, Hokusai",
"id": "chname:葛飾北斎"
}
]
}
GET test/_search
{
"query": {
"bool": {
"filter": [
{
"nested": {
"path": "agential",
"query": {
"bool": {
"filter": [
{
"term": {
"agential.id": "chname:葛飾北斎"
}
}
]
}
}
}
}
]
}
},
"_source": [
"agential"
],
"aggs": {
"agential": {
"nested": {
"path": "agential"
},
"aggs": {
"id": {
"terms": {
"field": "agential.id"
}
},
"label": {
"terms": {
"field": "agential.ja"
}
}
}
}
}
}
In this case, the following result is returned.
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.0,
"_source" : {
"agential" : [
{
"ja" : "葛飾北斎",
"en" : "Katsushika, Hokusai",
"id" : "chname:葛飾北斎"
}
]
}
}
]
},
"aggregations" : {
"agential" : {
"doc_count" : 1,
"label" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "葛飾北斎",
"doc_count" : 1
}
]
},
"id" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "chname:葛飾北斎",
"doc_count" : 1
}
]
}
}
}
}
Both label and id are returned in aggregations.agential, but this seems like a redundant result.
Method 2: String Concatenation
Let’s try the method raised in the question of the following article.
https://stackoverflow.com/questions/70545830/aggregation-that-returns-the-users-name-and-id
I prepared an fc-agential field and inserted a value that concatenates id, ja, and en values using $$$ as a delimiter.
DELETE test
PUT test
{
"mappings": {
"properties": {
"agential": {
"properties": {
"id": {
"type": "keyword"
},
"ja": {
"type": "keyword"
},
"en": {
"type": "keyword"
}
}
},
"fc-agential": {
"type": "keyword"
}
}
}
}
PUT test/_doc/1
{
"agential": [
{
"ja": "葛飾北斎",
"en": "Katsushika, Hokusai",
"id": "chname:葛飾北斎"
}
],
"fc-agential": [
"chname:葛飾北斎$$$葛飾北斎$$$Katsushika, Hokusai"
]
}
GET test/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"agential.id": "chname:葛飾北斎"
}
}
]
}
},
"_source": [
"agential"
],
"aggs": {
"agential": {
"terms": {
"field": "fc-agential"
}
}
}
}
In this case, the following result is returned.
{
"took" : 964,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.0,
"_source" : {
"agential" : [
{
"ja" : "葛飾北斎",
"en" : "Katsushika, Hokusai",
"id" : "chname:葛飾北斎"
}
]
}
}
]
},
"aggregations" : {
"agential" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "chname:葛飾北斎$$$葛飾北斎$$$Katsushika, Hokusai",
"doc_count" : 1
}
]
}
}
}
Since nested is not used, the query and results are simpler, but the data needs to be modified after retrieval.
Method 3: A Nested Top-Hits Aggregation?
Let’s try the method raised as an answer in the following article.
https://stackoverflow.com/questions/70545830/aggregation-that-returns-the-users-name-and-id
DELETE test
PUT test
{
"mappings": {
"properties": {
"agential": {
"properties": {
"id": {
"type": "keyword"
},
"ja": {
"type": "keyword"
},
"en": {
"type": "keyword"
}
}
}
}
}
}
PUT test/_doc/1
{
"agential": [
{
"ja": "葛飾北斎",
"en": "Katsushika, Hokusai",
"id": "chname:葛飾北斎"
}
]
}
GET test/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"agential.id": "chname:葛飾北斎"
}
}
]
}
},
"_source": [
"agential"
],
"aggs": {
"agential": {
"terms": {
"field": "agential.ja"
},
"aggs": {
"doc": {
"top_hits": {
"size": 1
}
}
}
}
}
}
In this case, the following result is returned.
{
"took" : 7,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.0,
"_source" : {
"agential" : [
{
"ja" : "葛飾北斎",
"en" : "Katsushika, Hokusai",
"id" : "chname:葛飾北斎"
}
]
}
}
]
},
"aggregations" : {
"agential" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "葛飾北斎",
"doc_count" : 1,
"doc" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.0,
"hits" : [
{
"_index" : "test",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.0,
"_source" : {
"agential" : [
{
"ja" : "葛飾北斎",
"en" : "Katsushika, Hokusai",
"id" : "chname:葛飾北斎"
}
]
}
}
]
}
}
}
]
}
}
}
Since aggregations.agential includes an example result, you can extract the relationship between id and ja from there. Compared to Method 2, the advantage is that there is no need to prepare a field like fc-agential, but this result also seems redundant.
Summary
I summarized the investigation results on aggregations with different keys and values (labels and IDs) in Elasticsearch. I hope this serves as a helpful reference.
Also, I believe there may be better methods beyond the three presented above. If anyone knows of any, I would appreciate it if you could share them.