ai airbyte data data-engineering engineering llm rag
2025-01-14 - Originally posted at https://airbyte.com/blog/permissions-for-ai-use-cases
↞ See all posts
Our visualization for bringing permissions to your AI application.
When building AI Pipelines, syncing who has access to the content is almost as important as syncing the content itself. In traditional data warehouse work, access to the tables in the warehouse is controlled by humans on the data/analytics team. If the viewer’s role is appropriate, they can view the table - e.g. the finance team can view the "purchases" table, but not the full “users” table, which contains PII. However, it will be common for AI applications to work against datasets of content, not just data - including text, videos, etc. This content is guaranteed to contain PII, sensitive business or financial information, and other “private” information. To that end, we can’t use the same human-controlled permission model as before, granting all users the same access to the “Google Drive Documents” table - their individual roles and access need to be preserved. Furthermore, when building secure AI applications, it is imperative that the context provided to the LLM only includes content that both the machine and end-user is allowed to see - relying on the AI itself to guard sensitive information has been regularly shown to be a flawed approach. To this end, when thinking about AI applications, a multi-stage permission model will be needed:
Not every source provides the ability to extract identities, roles, and permissions at the level we would like. This is why a 2-phase approach is needed, to allow some level of access control even for the worst sources.
The most secure way to check access would be to query the system or record (e.g. Google Docs APIs) at read time to confirm that the user still maintains access to the original document. This is a great approach when the user is requesting access to an individual resource (and the system of record is both performant and has high uptime). However, for AI use cases, which operate on either multiple resources (RAG) or aggregates (text-to-SQL), spending the time to look up every possible resource will be too high, and likely rate-limited. Therefore, we need to cache this information ahead of time to make it available to us at query time. This will introduce a few necessary bad properties into the system, notably permission lag (you might be able to see an item you shouldn’t see any more) and flattening (the nuance of the groups and ACLs in the source system will be lost and made more coarse)
To mitigate the lag issue, we will be allowing users to sync the permission data/stream more often than the content itself.
To mitigate the flattening issue, the context identity streams should not be used as a system of record, but rather as a metadata to the data stream.
Consider a Google Drive source. We will want to ingest all the presentations and documents available for our application, and maintain which of those documents each user has access to. Google Drive does provide APIs to load all of this information, so we can produce 2 streams of data:
The best sources will provide both of these streams. Each of these streams can be set to sync at different frequencies, and will likely be set to use different sync modes as well - Files will likely be incremental, but Identities will likely be full-refresh.
All File streams will gain the following properties:
allowed_identity_remote_ids
(list[str]
)denied_identity_remote_ids
(list[str]
)publicly_accessible
(bool
)The job of unrolling group memberships is the job of the source, and the schema of the Idendites stream will at-least include:
remote_id
(str
)email_address
(str
)member_email_addresses
(list[str]
)As an administrator of the ingestion pipeline, you may wish to use filters as a coarse way of adding role information to the dataset. For example, if your incoming dataset from Google Drive includes the original file paths of the documents, you may want to exclude any documents in the “exec” folder, as they are likely to be too sensitive. Or, you may want to split your dataset into “EU” and “USA” documents based on other pieces of metadata (e.g. folder name or group memberships), to provide limited or different information to different groups of users. This filtering step is especially useful when the source is not able to provide complete Identity and Role information - filtering can act as a stop-gap.
If the source is able to provide Identity and Role information, we need to join the streams together. This is done by way of a “mapping”, joining (in the SQL-sense) all the user and role information to the original content record so that we have an easy way to query who can access each item. This will be duplicative of the data (adding storage cost) with the goal of producing a faster runtime looking (much like an index). We have decided to use Email Addresses as the shared unit of identity across all sources.
It is finally time to use our data in an AI application. As the application/agent developer, you will choose to hit the RAG/Chat completion/Search/Aggregation APIs including a user’s email address or not. If an email address is provided, all the information provided back to you & the LLM will be further filtered to what they are allowed to access and the filtering provided by the context collection itself. Otherwise, they will have access to all the content in the (filtered) collection:
I write about Technology, Software, and Startups. I use my Product Management, Software Engineering, and Leadership skills to build teams that create world-class digital products.
Get in touch