i think before you guys spend the next 2 years building this startup, you should carefully study the connector business and the many carcasses along the way. YC itself has a few. it is probably one of the sloggiest businesses i know and while the success cases like fivetran are great, there is a lot of pain behind the failures. dont ask how i know. good luck and i hope you prove me wrong if you choose to ignore this.
Looks really cool! I had slight trouble understanding whether the repo is the complete codebase or if it connects to a separate backend.
Does one host the server or is it connecting to Airweave backend? Put another way: how does Airweave make $$ and where does data stay at rest (if it stays anywhere)?
Typically speaking an LLM is the code driving the control flow and the MCP servers are kind of dumb API endpoints (find_flights, search_hotels, etc) say for a travel MCP.
With your product, how is the LLM made aware of the underlying data store in a more useful way than “func search(query)”?
It seems to be that if you could expose some precomputed API structure into the MCP for a given data store then the LLM could reason more effectively about the data rather than throwing search queries into the void and hoping for the best?
From what I have gathered their main differentiator is taking the approach of assigning each discrete data point its own "entity" definition that is independent but can be extended for each data provider.
So since its all represented by entities, you could treat them like any other vectorised data in your vector data store and use vector search.
It's a nice technique, but probably tricky if they ever venture into encapsulating endpoints in realtime for rapidly changing b2c applications (ratelimits/cronjob latency)
Is chat always the best interface for all of these apps? I feel like search is the natural first step, but chat-based search has been around for a while. Feel like an MCP-based version of Glean/Onyx/Moveworks/Dashworks is interesting, but unsure how much better it makes the product. Curious to see why your product is better
Co-founder here. The Airweave interface doesn't discriminate which downstream use case it's applied in. Most current developers don't build it for a chat interface at all actually. Instead they fold it into their agents to give them access to user data. At first sight enterprise search looks quite similar, but instead this is a building block for developers to set up integrations for their internal agent / agent product.
Had meetings with a ton of MCP-server providers, no one came close to Airweave’s retrieval accuracy. I even tried Zapier and similar large companies, didn’t come near airweave. Highly highly recommend if you need third party integrations to your AI agents or workflows. Love the team too, cracked, cool, kind, and always there to support their customers (they even took one of their customers dog on a walk when they couldn’t lol)
You can self-host Airweave on Docker or Kubernetes within your VPC. We eventually want to move towards AWS/Azure/GCP marketplace offerings that should make this easier for you. RDS should work - if you get an instance with PSQL/MySQL dialect.
Co-founder here. The platform provides MCP or REST endpoints on top of searchable information. The tool is specifically geared towards agents that want to perform actions on external systems (through an MCP server, for example) but get confused about which objects to interact with. Airweave provides a robust interface for this.
You can compare it to how coding agents like Cursor work. This is the usual pattern you see:
- The first step is reading your prompt
- Then it goes through all the attached files and searches your codebase
- The last step is to make code file edits.
Non-coding agents that use "regular" MCP servers completely miss the second part. It's very hard to go from natural language instruction, to a chain of API calls that actually work and don't end up in hallucination
Hi, co-founder here. No Snowflake or Fabric yet. We do support some popular regular SQL connectors. We are working towards an async distributed processing architecture that should allow us to process >50M row datasets but we're still looking for strong usecase signal here. What would you like to do with it?
n8n is a good example of a tool that Airweave can enhance. n8n allows (no-code) developers to set up pre-determined automations but as soon as you want to process non-deterministic text into action on an app, you will still need a way to search the app. Example: you have a n8n workflow that gets you on track with Linear tickets. You hook it into a text-based human interface in which the user says: "I just created a task about database migration on Linear, can you start doing the preparations for it?". Airweave can 1. find that damn ticket, 2. give additional context on database migrations based on what else it finds in the integrated systems.
I assume you're talking about the data layer (not the control plane)? We are currently in PoC phase for mapping the role graphs from source systems (Asana, Google Drive) to our internal role model, but this is still in the works. The way developers work around this atm is by configuring a connection on a subset of the source info. Example: only make Airweave sync info from the `Shared Drive/Marketing/Branding` path
How do you handle data retention? For example say that you suck in the information of a California resident and the company is obligated by law to delete it on request. How do you ensure no derivative data exists within your model?
So you would like to delete information for a specific user identifier? Currently that means resyncing excluding that user profile (which would have to be removed from the source system) but happy to hear more about this use case. Would a desired feature be a “delete by user email” for example?
hi cofounder here. until now it's been custom deployments for customers with additional b2b/enterprise features. we're also releasing a managed service for a flat fee subscription
We're mostly focused on getting this right - better than any other tool atm. We are evaluating ideas like mapped RBAC, self-updating deep research and other tools for agent builders but it should first be very clear to us the devs actually need it :D
Are integrations hooked into via their MCP implementation? Or are you hooking in more traditionally and then exposing MCP on top of that?
Also, are these one-time/event-based syncs well supported by the integration providers? I know for instance that discord (and i assume others like slack) frown upon that sort of wholesale archival/syncing of entire chat rooms, presumably due to security concerns and to maintain their data moats.
Finally (i think), do you have to write custom "diff" logic for each integration in order to maintain up-to-date retrieval for each one? I assume it would be challenging to keep this accurate and well structured across so many different integration providers. Is there something i'm missing that makes keeping a local backup of your data easier for each service?
All in all, looks very cool. Have starred the repo to mess around with tonight.
1) the integrations are done traditionally so with REST/SQL. The MCP/REST search layer rests on the data that gets synced.
2) most providers are painless. Slack doesn’t want major exports in one go but most developers point at a single channel anyway so the rate limit errors don’t bite too much.
3) this is all orchestrated by the platform itself. Incremental syncs will receive the latest “watermark state” and sync from there. Hashes are used to compare data for persist actions (update/insert/keep)
Yes, we create service accounts on the source platforms which can then be used to do an OAuth or key based integration. What would you like to do specifically?
i think before you guys spend the next 2 years building this startup, you should carefully study the connector business and the many carcasses along the way. YC itself has a few. it is probably one of the sloggiest businesses i know and while the success cases like fivetran are great, there is a lot of pain behind the failures. dont ask how i know. good luck and i hope you prove me wrong if you choose to ignore this.
Looks really cool! I had slight trouble understanding whether the repo is the complete codebase or if it connects to a separate backend.
Does one host the server or is it connecting to Airweave backend? Put another way: how does Airweave make $$ and where does data stay at rest (if it stays anywhere)?
This is a great idea. I have a question:
Typically speaking an LLM is the code driving the control flow and the MCP servers are kind of dumb API endpoints (find_flights, search_hotels, etc) say for a travel MCP.
With your product, how is the LLM made aware of the underlying data store in a more useful way than “func search(query)”?
It seems to be that if you could expose some precomputed API structure into the MCP for a given data store then the LLM could reason more effectively about the data rather than throwing search queries into the void and hoping for the best?
From what I have gathered their main differentiator is taking the approach of assigning each discrete data point its own "entity" definition that is independent but can be extended for each data provider.
So since its all represented by entities, you could treat them like any other vectorised data in your vector data store and use vector search.
It's a nice technique, but probably tricky if they ever venture into encapsulating endpoints in realtime for rapidly changing b2c applications (ratelimits/cronjob latency)
Is chat always the best interface for all of these apps? I feel like search is the natural first step, but chat-based search has been around for a while. Feel like an MCP-based version of Glean/Onyx/Moveworks/Dashworks is interesting, but unsure how much better it makes the product. Curious to see why your product is better
Co-founder here. The Airweave interface doesn't discriminate which downstream use case it's applied in. Most current developers don't build it for a chat interface at all actually. Instead they fold it into their agents to give them access to user data. At first sight enterprise search looks quite similar, but instead this is a building block for developers to set up integrations for their internal agent / agent product.
Had meetings with a ton of MCP-server providers, no one came close to Airweave’s retrieval accuracy. I even tried Zapier and similar large companies, didn’t come near airweave. Highly highly recommend if you need third party integrations to your AI agents or workflows. Love the team too, cracked, cool, kind, and always there to support their customers (they even took one of their customers dog on a walk when they couldn’t lol)
Noob here - why would mcp providers have a good accuracy?
Don’t they just adjust existing apis to mcp protocol basically just wrapping them?
Yes, a lot MCP servers are just api wrappers. Airweave looks like it copies the data and has a RAG that is processing your queries.
This is exactly the reason we started building Airweave! The “context” in MCP is a bit deceiving, as it actually provides very little context.
I was looking everywhere for some solution like this. Finally! Curious, do you guys integrate with internal data sources within a company?
Pretty cool stuff. How does it deal with self-hosted data sources? can it run inside a VPC and talk to my RDS instances directly?
You can self-host Airweave on Docker or Kubernetes within your VPC. We eventually want to move towards AWS/Azure/GCP marketplace offerings that should make this easier for you. RDS should work - if you get an instance with PSQL/MySQL dialect.
How about acting? Can your platform have a chat window create a ticket or send a message? I feel there is so much search already.
Used to work at AWS bedrock knowledge bases — love the way yall are thinking about this can't wait to try it out.
How is this different from regular MCP servers?
Co-founder here. The platform provides MCP or REST endpoints on top of searchable information. The tool is specifically geared towards agents that want to perform actions on external systems (through an MCP server, for example) but get confused about which objects to interact with. Airweave provides a robust interface for this.
You can compare it to how coding agents like Cursor work. This is the usual pattern you see: - The first step is reading your prompt - Then it goes through all the attached files and searches your codebase - The last step is to make code file edits.
Non-coding agents that use "regular" MCP servers completely miss the second part. It's very hard to go from natural language instruction, to a chain of API calls that actually work and don't end up in hallucination
Do you really have 100+ connectors?
Looks like a great product! Do you integrate with datalakes (Snowflake/Fabric?)
Hi, co-founder here. No Snowflake or Fabric yet. We do support some popular regular SQL connectors. We are working towards an async distributed processing architecture that should allow us to process >50M row datasets but we're still looking for strong usecase signal here. What would you like to do with it?
Pretty cool – when does it make sense to use this vs n8n?
n8n is a good example of a tool that Airweave can enhance. n8n allows (no-code) developers to set up pre-determined automations but as soon as you want to process non-deterministic text into action on an app, you will still need a way to search the app. Example: you have a n8n workflow that gets you on track with Linear tickets. You hook it into a text-based human interface in which the user says: "I just created a task about database migration on Linear, can you start doing the preparations for it?". Airweave can 1. find that damn ticket, 2. give additional context on database migrations based on what else it finds in the integrated systems.
Nice - does it have role based access controls built in?
I assume you're talking about the data layer (not the control plane)? We are currently in PoC phase for mapping the role graphs from source systems (Asana, Google Drive) to our internal role model, but this is still in the works. The way developers work around this atm is by configuring a connection on a subset of the source info. Example: only make Airweave sync info from the `Shared Drive/Marketing/Branding` path
How do you handle data retention? For example say that you suck in the information of a California resident and the company is obligated by law to delete it on request. How do you ensure no derivative data exists within your model?
So you would like to delete information for a specific user identifier? Currently that means resyncing excluding that user profile (which would have to be removed from the source system) but happy to hear more about this use case. Would a desired feature be a “delete by user email” for example?
Looks cool! How are you thinking about pricing it?
hi cofounder here. until now it's been custom deployments for customers with additional b2b/enterprise features. we're also releasing a managed service for a flat fee subscription
this is so powerful - MCP on steroids :). what are the next use cases you're looking to build?
We're mostly focused on getting this right - better than any other tool atm. We are evaluating ideas like mapped RBAC, self-updating deep research and other tools for agent builders but it should first be very clear to us the devs actually need it :D
fyi there is a project with almost phonetic writing name to yours - arweave.org
it's also the name of a mattress
Lol, good to know. Thanks
Are integrations hooked into via their MCP implementation? Or are you hooking in more traditionally and then exposing MCP on top of that?
Also, are these one-time/event-based syncs well supported by the integration providers? I know for instance that discord (and i assume others like slack) frown upon that sort of wholesale archival/syncing of entire chat rooms, presumably due to security concerns and to maintain their data moats.
Finally (i think), do you have to write custom "diff" logic for each integration in order to maintain up-to-date retrieval for each one? I assume it would be challenging to keep this accurate and well structured across so many different integration providers. Is there something i'm missing that makes keeping a local backup of your data easier for each service?
All in all, looks very cool. Have starred the repo to mess around with tonight.
Good questions.
1) the integrations are done traditionally so with REST/SQL. The MCP/REST search layer rests on the data that gets synced.
2) most providers are painless. Slack doesn’t want major exports in one go but most developers point at a single channel anyway so the rate limit errors don’t bite too much.
3) this is all orchestrated by the platform itself. Incremental syncs will receive the latest “watermark state” and sync from there. Hashes are used to compare data for persist actions (update/insert/keep)
If we want to integrate our SAAS apps into airweave, is there an appexchange or directory for doing so?
Yes, we create service accounts on the source platforms which can then be used to do an OAuth or key based integration. What would you like to do specifically?