It is fast but, nowhere close to accurate or useful for this specific example. Could not find a way to force the plural form. Neither quotes nor plus worked.
Is there distributed server support? I see it on the list of new features with (currently PoC) next to it, but is the code for the PoC available anywhere?
Also, would there be any potential issues if the index was mounted on shared storage between multiple instances?
The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.
As for shared storage, do you mean something like NAS or, rather Amazon S3?
Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.
I really like your approach. Impressed by your care for performance and your fast pace of adding what appears to be pretty complex stuff, while making sure it stays performant.
What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.
Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.
I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.
Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.
We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.
I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.
Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.
How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...
Yes, integration in complex legacy systems is always challenging.
As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure.
As SeekStorm is open-source, system integrators can take it from there.
How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.
Never mind, found that someone posted a link already.
On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.
Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.
What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.
How is it different from Meilisearch[1]? I’m running search for my small multi tenant SaaS and self hosted Meilisearch gives me grief like any relatively new tech, so I’m shopping for new solutions.
Full version: I run it on a dedicated machine 2vcpu2gb on digital ocean. Every tenant has an index and i have like 30k searches per week across all tenants. Each tenant has from 1 to 150k documents in their index. Sentry catches MeilisearchTimeoutException couple times every day with the message that Meilisearch could not finish adding document to index. I don’t care too much about that because background worker is responsible for updating index, so that tasks gets rescheduled. I like to keep my sentry clean, so it’s more an inconvenience than the issue.
Meilisearch setup is very straightforward, they provide client libraries for almost all languages (maybe even for esoteric and marginal, idk, i only need python), have pretty decent documentation covering the basics and don’t really require operations at my scale. I really liked the feature of issuing the limited access tokens to be able to set the pre condition. That’s how i limit the searches for particular user on the tenant to see only their data.
When search is cheap and quick, it's possible to improve search by postprocessing search results and running more queries when necessary.
I use Tantivy, and add refinements like: if the top result is objectively a low-quality one, it's usually a query with a typo finding a document with the same typo, so I run the query again with fuzzy spelling. If all the top results have the same tag (that isn't in the query), then I mix in results from another search with the most common tag excluded. If the query is a word that has multiple meanings, I can ensure that each meaning is represented in the top results.
When using SeekStorm as a server, keeping the latency per query low increases the throughput and the number of parallel queries a server can handle on top of a given hardware. An efficient search server can reduce the required investments in server hardware.
In other cases, only the local search performance matters, e.g., for data mining or RAG.
Also, it's not only about averages but also about tail latencies. While network latencies dominate the average search time, that is not the case for tail latencies, which in turn heavily influence user satisfaction and revenue in online shopping.
Very impressive results. I'm curious how you benchmarked against bm25 in terms of accuracy? I couldn't find metrics around that, just one search example. I think there are use cases where latency is king, but when it comes to vector search / hybrid search accuracy is probably more important.
For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.
For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.
Systematic relevancy benchmarks like BeIR, MS MARCO are planned.
It feels like everyone re-implement the same application, searching text in language x.y.z has been done a million times, search speed in not a problem so what differenciate this solution with the dozen+ mature ones.
The speed looks great but isn't everything else already fast enough?
Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.
In addition to what you said, faster searches can also provide different search options. For example, if you can execute five similar searches in the same time that it would take to execute one. You now have the option to ask "Can I leverage five similar searches to produce better results" and if the answer is yes, you can now provide better answers and still keep the same user experience.
Where I really think faster searches will come into play is with AI. There is nothing energy efficient about how LLM work and I really think Enterprise will focus on using LLM to generate as many Q and A pairs during off peak energy hours and using a hybrid search that can bridge semantic (vector) and text. I think for Enterprise the risk of hallucinations (even with RAG) will be too great and fall back to traditional search, but with a better user experience.
Based on the README, it looks like vector search is not supported or planned, but it would be interesting to see if SeekStorm can do this more efficiently than Lucene/OpenSearch and others. I only dabbled in the search space, so I don't know how complex this would be, but I think SeekStorm can become a killer search solution if it can support both.
Software is currently extremely inefficient, driven by years of increasingly powerful cheap hardware. Once that starts to slow it makes sense that we start squeezing efficiency out of software again. We’ve also seen in the last 20 years the rise of languages that make writing performant, higher-level software a lot easier.
We’re also at a point where cloud compute is consuming a significant amount of energy globally.
In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former.
https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...
I don't know how fair the benchmark is, but beating Tantivy by that margin is impressive to say the least.
Any plan to make it run on WASM? I wanted to add this feature to Tantivy a few years ago but they weren't interested, and I had to fall back to a JavaScript search engine that was much slower.
Developer of tantivy chiming in! (I hope that's ok) Database performance is a space where there are a lot of lies and bullshit, so you are 100% right to be suspicious.
I don't know SeekStorm's team and I did not dig much into the details, but my impression so far is that their benchmark's results are fair. At least I see no reason not to trust them.
The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.
PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.
Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.
Impressive, bookmarked, upvoted.
Appreciate the demo: https://deephn.org/?q=apple+silicon
Seems to buggy. According to the SeekStorm github, it's supposed to support boolean operators right? But they don't seem to work.
Eg: https://deephn.org/?q=Linux+OR+KDE
Counter-example: https://deephn.org/?q=embeddings Contrast with https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...
It is fast but, nowhere close to accurate or useful for this specific example. Could not find a way to force the plural form. Neither quotes nor plus worked.
Is there distributed server support? I see it on the list of new features with (currently PoC) next to it, but is the code for the PoC available anywhere?
Also, would there be any potential issues if the index was mounted on shared storage between multiple instances?
The code for the distributed search cluster is not yet stable enough to be published, but it will be released as open-source as well.
As for shared storage, do you mean something like NAS or, rather Amazon S3? Cloud-native support of object storage and separating storage and compute is on our roadmap. Challenges will be maintaining latency and the need for more sophisticated caching.
S3 support would be absolutely killer.
I really like your approach. Impressed by your care for performance and your fast pace of adding what appears to be pretty complex stuff, while making sure it stays performant.
Keep it up!
Bookmarked.
What is the story for multi-language corpus? Do I have to do my own stop word pruning, tokenizing, lemming, etc? This is usually the case with full-text search solutions and it is a pain.
Re: stemming and lemming, I just want to plug the most impressive NLP stack I ever used, "chat script", really it's for building dialog trees where it walks down a branch of conversation using effectively switch statements but with really rich conceptual pattern matching and capturing - so somewhere in the middle of the stack it has excellent abstracting from word input to general concept (in WordNet), performing all the spell correction (according to your dictionary), stem, lem, and disambiguation.
I've had it in mind for a while to build a fuzzy search tool based on parsing each phrase into concepts, parsing the search query into concepts, and finding nearest match based on that. It's a C library and very fast.
https://github.com/ChatScript/ChatScript
Looks like it hasn't been committed to in some time, I'll have to check out their blog and see what's up. I guess with the advent of LLMs, dialog trees are passé.
Their company home page, http://brilligunderstanding.com/ wow..
We started with making the core search technology faster. Then we added a Unicode character folding/normalization tokenizer (diacritics, accents, umlauts, bold, italic, full-width chars...). Last week we added a tokenizer that supports Chinese word segmentation. Currently, we are working on a multi-language tokenizer, that segments Chinese, Japanese an Korean without switching the tokenizer.
I hope the folding and normalization is configurable by language. I really hate it when some search decides that a and ä are the same letter. In Finnish they really aren't; "saari" is an island, "sääri" is the lower leg or shin.
Currently, you can choose between tokenizers with or without folding. But configurability per language or full customizability of the folding logic by the user is a good idea.
This is really impressive, I would suggest benchmarking it against Vespa as well I have gotten better perf results from Vespa than Lucerne/Solr/ES.
I’ll take a try this weekend as well.
Demo = impressed.
How's SeekStorm's prowess in mid-cap enterprise? How hairy is the ingest pipeline for sources like: decade old sharepoint sites, PDFs with partial text layers, excel, email.msg files, etc...
Yes, integration in complex legacy systems is always challenging. As a small startup, we are concentrating on core search technology to make search faster and to make the most of available server infrastructure. As SeekStorm is open-source, system integrators can take it from there.
Same as any other full-text search solution - it's your job to integrate it.
>Demo = impressed.
How did you demo? Did you spin up your own instance and index the wikipedia corpus like the docs suggest? I'd like to just give it a whirl on an already running instance.
Never mind, found that someone posted a link already.
On that topic, can anybody chime in on state of the art PDF OCR? Even if that's a multimodal LLM, I've used ChatGPT to extract tabular data from images but need something I can self host for proprietary data.
Azure Document Intelligence (especially with the layout model[0]) is really good. It has both JSON and MD output modes and does a pretty solid job identifying headers, sections, tables, etc.
What's interesting is that they have a self-deployable container model[1] that only phones home for billing so you can self-host the runtime and model.
[0] https://learn.microsoft.com/en-us/azure/ai-services/document...
[1] https://learn.microsoft.com/en-us/azure/ai-services/document...
Peculiar, Thanks!
How is it different from Meilisearch[1]? I’m running search for my small multi tenant SaaS and self hosted Meilisearch gives me grief like any relatively new tech, so I’m shopping for new solutions.
1: https://www.meilisearch.com/
Well, off the bat, this seems to be able to embedded directly into your rust project without the need for a standalone server.
Could you share more about your experience with Meilisearch?
Tl;dr: 4/5 stars for hobbit software SaaS.
—————
Full version: I run it on a dedicated machine 2vcpu2gb on digital ocean. Every tenant has an index and i have like 30k searches per week across all tenants. Each tenant has from 1 to 150k documents in their index. Sentry catches MeilisearchTimeoutException couple times every day with the message that Meilisearch could not finish adding document to index. I don’t care too much about that because background worker is responsible for updating index, so that tasks gets rescheduled. I like to keep my sentry clean, so it’s more an inconvenience than the issue. Meilisearch setup is very straightforward, they provide client libraries for almost all languages (maybe even for esoteric and marginal, idk, i only need python), have pretty decent documentation covering the basics and don’t really require operations at my scale. I really liked the feature of issuing the limited access tokens to be able to set the pre condition. That’s how i limit the searches for particular user on the tenant to see only their data.
Sub-millisecond latency sounds impressive, but isn't network latency going to overshadow these gains in most real-world scenarios?
When search is cheap and quick, it's possible to improve search by postprocessing search results and running more queries when necessary.
I use Tantivy, and add refinements like: if the top result is objectively a low-quality one, it's usually a query with a typo finding a document with the same typo, so I run the query again with fuzzy spelling. If all the top results have the same tag (that isn't in the query), then I mix in results from another search with the most common tag excluded. If the query is a word that has multiple meanings, I can ensure that each meaning is represented in the top results.
It depends on the application.
When using SeekStorm as a server, keeping the latency per query low increases the throughput and the number of parallel queries a server can handle on top of a given hardware. An efficient search server can reduce the required investments in server hardware.
In other cases, only the local search performance matters, e.g., for data mining or RAG.
Also, it's not only about averages but also about tail latencies. While network latencies dominate the average search time, that is not the case for tail latencies, which in turn heavily influence user satisfaction and revenue in online shopping.
A typical server is serving more than one request at a time, hopefully.
Very impressive results. I'm curious how you benchmarked against bm25 in terms of accuracy? I couldn't find metrics around that, just one search example. I think there are use cases where latency is king, but when it comes to vector search / hybrid search accuracy is probably more important.
For the latency benchmarks we used vanilla BM25 (SimilarityType::Bm25f for a single field) for comparability, so there are no differences in terms of accuracy.
For SimilarityType::Bm25fProximity which takes into account the proximity between query term matches within the document, we have so far only anecdotal evidence that it returns significantly more relevant results for many queries.
Systematic relevancy benchmarks like BeIR, MS MARCO are planned.
Interesting approach, would love to see a comparison with Typesense
The documentation seems a bit sparse. Also, I couldn't find binaries so I'm guessing building from source is required at the moment?
I'm curious about the binary size of it all. Could this be compiled with WASM and run on static pages?
The Seekstorm library is 9 MB, and the Seekstorm server executable is 8 MB, depending on the features selected in cargo.
You add the library via 'cargo add seekstorm' to your project which you anyway have to compile.
As for the server, we may add binaries for the main OS in the future.
WASM and Python bindings are on our roadmap.
It feels like everyone re-implement the same application, searching text in language x.y.z has been done a million times, search speed in not a problem so what differenciate this solution with the dozen+ mature ones.
The speed looks great but isn't everything else already fast enough?
Its not just about speed. Speed reflects efficiency. Efficiency is needed to serve more queries in parallel, to search within exponentially growing data, with less expensive hardware, and fewer servers, consuming less energy. Therefore the pursuit for efficiency never gets outdated and has no limit.
In addition to what you said, faster searches can also provide different search options. For example, if you can execute five similar searches in the same time that it would take to execute one. You now have the option to ask "Can I leverage five similar searches to produce better results" and if the answer is yes, you can now provide better answers and still keep the same user experience.
Where I really think faster searches will come into play is with AI. There is nothing energy efficient about how LLM work and I really think Enterprise will focus on using LLM to generate as many Q and A pairs during off peak energy hours and using a hybrid search that can bridge semantic (vector) and text. I think for Enterprise the risk of hallucinations (even with RAG) will be too great and fall back to traditional search, but with a better user experience.
Based on the README, it looks like vector search is not supported or planned, but it would be interesting to see if SeekStorm can do this more efficiently than Lucene/OpenSearch and others. I only dabbled in the search space, so I don't know how complex this would be, but I think SeekStorm can become a killer search solution if it can support both.
Edit: My bad, it looks like vector search is PoC.
Software is currently extremely inefficient, driven by years of increasingly powerful cheap hardware. Once that starts to slow it makes sense that we start squeezing efficiency out of software again. We’ve also seen in the last 20 years the rise of languages that make writing performant, higher-level software a lot easier.
We’re also at a point where cloud compute is consuming a significant amount of energy globally.
I'm not sure it's a good idea to use mmap for this.
https://db.cs.cmu.edu/mmap-cidr2022/
In SeekStorm you can choose per index whether to use Mmap or let SeekStorm fully control Ram access. There is a slight performance advantage to the latter, at the cost of higher index load time of the former. https://docs.rs/seekstorm/latest/seekstorm/index/enum.Access...
I wonder how burnt sushi feels about this
I don't know how fair the benchmark is, but beating Tantivy by that margin is impressive to say the least.
Any plan to make it run on WASM? I wanted to add this feature to Tantivy a few years ago but they weren't interested, and I had to fall back to a JavaScript search engine that was much slower.
Developer of tantivy chiming in! (I hope that's ok) Database performance is a space where there are a lot of lies and bullshit, so you are 100% right to be suspicious.
I don't know SeekStorm's team and I did not dig much into the details, but my impression so far is that their benchmark's results are fair. At least I see no reason not to trust them.
Also we are working on some performance improvements based on the benchmark comparison, as they highlighted some areas we can improve in tantivy.
The benchmark should be fairly fair, as it was developed by Tantivy themselves (and Jason Wolfe). So, the choice of corpus and queries was theirs. But, of course, your mileage may vary. It is always best to benchmark it on your machine with your data and your queries.
Yes, WASM and Python bindings are on our roadmap.
How does this compare to PostgreSQL?
PostgreSQL is an SQL database that also offers full-text search (FTS), with extensions like pg_search it also supports BM25 scoring which is essential for lexical search. SeekStorm is centered around full-text search only, it doesn't offer SQL.
Performance-wise it would be indeed interesting to run a benchmark. The third-party open-source benchmark we are currently using (search_benchmark_game) does not yet support PostgreSQL. So yes, that comparison is still pending.
When I tried to use FTS in Postgres, I got terrible performance, but maybe I was doing something wrong. I'm using Meili now.
Same here, this would easily beat it as far as I have seen, but maybe I did something wrong.
ParadeDB (paradedb.com) is similar to this HN, but baked into Postgres to solve this very problem you are describing