Skip to main content

 

𝘏𝘪𝘨𝘩-𝘗𝘦𝘳𝘧𝘰𝘳𝘮𝘢𝘯𝘤𝘦 𝘐𝘯𝘨𝘦𝘴𝘵𝘪𝘰𝘯 and Storage Into 3000 𝘋𝘦𝘭𝘵𝘢 𝘛𝘢𝘣𝘭𝘦𝘴 With Spark Streaming. 🙌

𝗕𝗲𝗳𝗼𝗿𝗲 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀

• Grammarly collects the data points from their mobile, web clients across 30 million users and 50000 teams.
• All raw data is ingested through managed self data streams, catalogued, then send to pipeline for data attribution, data enrichment.
• Combined data with external integrations such as Salesforce data, Text files. Stored & Indexed at the storage layer. Then sent for Analytics. 𝘛𝘰𝘵𝘢𝘭 𝘵𝘪𝘮𝘦 𝘪𝘵 𝘵𝘰𝘰𝘬 𝘵𝘰 𝘳𝘦𝘢𝘤𝘩 𝘪𝘯𝘴𝘪𝘨𝘩𝘵𝘴 𝘧𝘳𝘰𝘮 𝘳𝘢𝘸 𝘥𝘢𝘵𝘢 is 𝘴𝘰𝘭𝘪𝘥 4 𝘩𝘰𝘶𝘳𝘴.
• Only Single data entry to the data architecture and single output to Dashboards through analytics dashboard at Gnar.
• Grammarly had built in-house data analytics platform called Gnar in 2015 and by 2022, it raised many scalability issues and new incoming events just outpaced the capabilities of Gnar.

𝗘𝗻𝘁𝗲𝗿 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀

Migrated whole Data Infrastructure to Databricks Delta Lake Architecture.

All Grammarly clients and backend systems send data to new Ingestion Infrastructure.

New Ingestion has various entries points from multiple sources through Kafka Data Streams, Batch, Change Data Capture and Fivetran.

**𝗨𝗻𝗶𝘁𝘆 𝗖𝗮𝘁𝗮𝗹𝗼𝗴𝘂𝗲** – Uniform Access control over Tables, Columns, Files, Notebooks across teams. Managed Security.

**𝗗𝗮𝘁𝗮 𝗦𝘁𝗼𝗿𝗮𝗴𝗲** – Bucket for files, Parquet tables with Partition for streaming data, Intial copy then Incremental changes for changing data.

**𝗗𝗮𝘁𝗮 𝗣𝗶𝗽𝗲𝗹𝗶𝗻𝗲𝘀** – **Databricks Medallion architecture** (Raw Json data at Broze layer to strict Schema enforced Silver Layer) using **𝗦𝗽𝗮𝗿𝗸 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴** to 3000 **Delta Tables**. Data partitioned across 40 different parameters to increase the scalability of clusters for various size clients.

𝗕𝗮𝗰𝗸𝗳𝗶𝗹𝗹𝗶𝗻𝗴 𝗗𝗮𝘁𝗮 – Yes! Team migrated old 2015 to 2022 data from old Data Infra to new one. Reusing the Medallion architecture saved time to migrate old system data to new structure.

𝗗𝗮𝘁𝗮 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗩𝗶𝘀𝘂𝗮𝗹𝗶𝘀𝗮𝘁𝗶𝗼𝗻 – Databricks SQL, Feature Store, Tableau. Team can use Python, SQL, Scala to analyse the data.

Results are staggering – Event processing is reduced from 4 hours to just 15 minutes!! Upto 5 billion events available to analytics, ready to query and deliver key insights company wide.

If you are looking forward to building your own data infrastructure with Databricks, reach out today for a consultation.

Leave a Reply