DE int

 0    18 flashcards    guest3164346
download mp3 print play test yourself
 
Question English Answer English
ETL (Extract, Transform, Load)
start learning
A process where data is extracted from a source, transformed (e.g. cleaned or aggregated), and then loaded into a database or data warehouse.
ELT (Extract, Load, Transform)
start learning
Raw data is first loaded into the destination (like BigQuery), and then transformed using SQL or other tools inside the warehouse.
DAG (Directed Acyclic Graph – Airflow)
start learning
A structure used in Airflow to define workflows. It represents a sequence of tasks that must run in a specific, non-circular order.
Partitioning (BigQuery)
start learning
Dividing a large table into parts (usually by date) to make queries faster and cheaper by scanning only relevant partitions.
JOIN (SQL)
start learning
A way to combine data from two or more tables based on a related column (e.g. user_id).
HAVING vs WHERE (SQL)
start learning
WHERE filters rows before aggregation; HAVING filters after. Example: HAVING COUNT(*) > 100.
PySpark
start learning
Python API for Apache Spark. It’s used to process very large datasets in a distributed, parallelized way.
BigQuery
start learning
A serverless cloud data warehouse from Google, designed for running fast SQL queries on large datasets.
Data Lake
start learning
A storage system for raw, unstructured, or semi-structured data — often used for flexible analytics or staging.
Data Warehouse
start learning
A structured database optimized for analysis and reporting, typically holding cleaned and transformed data.
Airflow Operator
start learning
A unit of work in Airflow DAGs – defines what each task does (e.g. PythonOperator, BashOperator).
Kafka Topic
start learning
A named data stream in Apache Kafka where producers send and consumers receive messages.
IAM (Identity and Access Management – GCP)
start learning
A system for managing permissions and access to resources in Google Cloud – defines who can do what.
KPI (Key Performance Indicator)
start learning
A measurable value that shows how effectively a process or business is performing (e.g. conversion rate, average delay).
Lazy Evaluation (Spark)
start learning
Transformations are not executed until an action (like. count() or. collect()) is called – helps optimize performance.
Retry (Airflow)
start learning
A setting that allows a task to be automatically retried after failure, helpful for unstable operations.
Data Validation
start learning
The process of ensuring that data is accurate and consistent – includes checking for missing values, duplicates, or wrong formats.
Window Function (SQL)
start learning
A function that performs calculations across a "window" of rows related to the current row, without collapsing them into a single result (e.g. ROW_NUMBER(), AVG(...) OVER(...)).

You must sign in to write a comment